Обсуждение: Failed transaction statistics to measure the logical replication progress

Поиск
Список
Период
Сортировка

Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hello, hackers


When the current HEAD fails during logical decoding, the failure
increments txns count in pg_stat_replication_slots - [1] and adds
the transaction size to the sum of bytes in the same repeatedly
on the publisher, until the problem is solved.
One of the good examples is duplication error on the subscriber side
and this applies to both streaming and spill cases as well.

This update prevents users from grasping the exact number and size of
successful and unsuccessful transactions. Accordingly, we need to
have new columns of failed transactions that will work to differentiate
both of them for all types, which means spill, streaming and normal
transactions. This will help users to measure the exact status of
logical replication.

Attached file is the POC patch for this.
Current design is to save failed stats data in the ReplicationSlot struct.
This is because after the error, I'm not able to access the ReorderBuffer object.
Thus, I chose the object where I can interact with at the ReplicationSlotRelease timing.

Below is one example that I can get on the publisher,
after the duplication error on the subscriber caused by insert is solved.

postgres=# select * from pg_stat_replication_slots;
-[ RECORD 1 ]-------+------
slot_name           | mysub
spill_txns          | 0
spill_count         | 0
spill_bytes         | 0
failed_spill_txns   | 0
failed_spill_bytes  | 0
stream_txns         | 0
stream_count        | 0
stream_bytes        | 0
failed_stream_txns  | 0
failed_stream_bytes | 0
total_txns          | 4
total_bytes         | 528
failed_total_txns   | 3
failed_total_bytes  | 396
stats_reset         |


Any ideas and comments are welcome.

[1] - https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-SLOTS-VIEW


Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Thu, Jul 8, 2021 at 12:25 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Hello, hackers
>
>
> When the current HEAD fails during logical decoding, the failure
> increments txns count in pg_stat_replication_slots - [1] and adds
> the transaction size to the sum of bytes in the same repeatedly
> on the publisher, until the problem is solved.
> One of the good examples is duplication error on the subscriber side
> and this applies to both streaming and spill cases as well.
>
> This update prevents users from grasping the exact number and size of
> successful and unsuccessful transactions. Accordingly, we need to
> have new columns of failed transactions that will work to differentiate
> both of them for all types, which means spill, streaming and normal
> transactions. This will help users to measure the exact status of
> logical replication.
>
> Attached file is the POC patch for this.
> Current design is to save failed stats data in the ReplicationSlot struct.
> This is because after the error, I'm not able to access the ReorderBuffer object.
> Thus, I chose the object where I can interact with at the ReplicationSlotRelease timing.
>
> Below is one example that I can get on the publisher,
> after the duplication error on the subscriber caused by insert is solved.
>
> postgres=# select * from pg_stat_replication_slots;
> -[ RECORD 1 ]-------+------
> slot_name           | mysub
> spill_txns          | 0
> spill_count         | 0
> spill_bytes         | 0
> failed_spill_txns   | 0
> failed_spill_bytes  | 0
> stream_txns         | 0
> stream_count        | 0
> stream_bytes        | 0
> failed_stream_txns  | 0
> failed_stream_bytes | 0
> total_txns          | 4
> total_bytes         | 528
> failed_total_txns   | 3
> failed_total_bytes  | 396
> stats_reset         |
>
>
> Any ideas and comments are welcome.

+1 for having logical replication failed statistics. Currently if
there is any transaction failure in the subscriber after sending the
decoded data to the subscriber like constraint violation, object not
exist, the statistics will include the failed decoded transaction info
and there is no way to identify the actual successful transaction
data. This patch will help in measuring the actual decoded transaction
data.

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, July 13, 2021 2:50 PM vignesh C <vignesh21@gmail.com> wrote:
> > When the current HEAD fails during logical decoding, the failure
> > increments txns count in pg_stat_replication_slots - [1] and adds the
> > transaction size to the sum of bytes in the same repeatedly on the
> > publisher, until the problem is solved.
> > One of the good examples is duplication error on the subscriber side
> > and this applies to both streaming and spill cases as well.
> >
> > This update prevents users from grasping the exact number and size of
> > successful and unsuccessful transactions. Accordingly, we need to have
> > new columns of failed transactions that will work to differentiate
> > both of them for all types, which means spill, streaming and normal
> > transactions. This will help users to measure the exact status of
> > logical replication.
> >
> > Attached file is the POC patch for this.
> > Current design is to save failed stats data in the ReplicationSlot struct.
> > This is because after the error, I'm not able to access the ReorderBuffer
> object.
> > Thus, I chose the object where I can interact with at the
> ReplicationSlotRelease timing.
> > Any ideas and comments are welcome.
...
> +1 for having logical replication failed statistics. Currently if
> there is any transaction failure in the subscriber after sending the decoded
> data to the subscriber like constraint violation, object not exist, the statistics
> will include the failed decoded transaction info and there is no way to identify
> the actual successful transaction data. This patch will help in measuring the
> actual decoded transaction data.
Yeah, we can apply this improvement to other error cases.
Thank you for sharing ideas to make this enhancement more persuasive.

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Ajin Cherian
Дата:
On Thu, Jul 8, 2021 at 4:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

> Attached file is the POC patch for this.
> Current design is to save failed stats data in the ReplicationSlot struct.
> This is because after the error, I'm not able to access the ReorderBuffer object.
> Thus, I chose the object where I can interact with at the ReplicationSlotRelease timing.

I think this is a good idea to capture the failed replication stats.
But I'm wondering how you are deciding if
the replication failed or not? Not all cases of ReplicationSLotRelease
are due to a failure. It could also be due to a planned dropping
of subscription or disable of subscription. I have not tested this but
won't the failed stats be updated in this case as well? Is that
correct?

regards,
Ajin Cherian
Fujitsu Australia



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, July 27, 2021 3:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
> On Thu, Jul 8, 2021 at 4:55 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> 
> > Attached file is the POC patch for this.
> > Current design is to save failed stats data in the ReplicationSlot struct.
> > This is because after the error, I'm not able to access the ReorderBuffer
> object.
> > Thus, I chose the object where I can interact with at the
> ReplicationSlotRelease timing.
> 
> I think this is a good idea to capture the failed replication stats.
> But I'm wondering how you are deciding if the replication failed or not? Not all
> cases of ReplicationSLotRelease are due to a failure. It could also be due to a
> planned dropping of subscription or disable of subscription. I have not tested
> this but won't the failed stats be updated in this case as well? Is that correct?
Yes, what you said is true. Currently, when I run DROP SUBSCRIPTION or
ALTER SUBSCRIPTION DISABLE, failed stats values are added
to pg_stat_replication_slots unintentionally, if they have some left values.
This is because all those commands, like the subscriber apply failure
by duplication error, have the publisher get 'X' message at ProcessRepliesIfAny()
and go into the path to call ReplicationSlotRelease().

Also, other opportunities like server stop call the same in the end,
which leads to a situation that after the server restart,
the value of failed stats catch up with the (successful) existing stats values.
Accordingly, I need to change the patch to adjust those situations.
Thank you.


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Thu, Jul 8, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Hello, hackers
>
>
> When the current HEAD fails during logical decoding, the failure
> increments txns count in pg_stat_replication_slots - [1] and adds
> the transaction size to the sum of bytes in the same repeatedly
> on the publisher, until the problem is solved.
> One of the good examples is duplication error on the subscriber side
> and this applies to both streaming and spill cases as well.
>
> This update prevents users from grasping the exact number and size of
> successful and unsuccessful transactions. Accordingly, we need to
> have new columns of failed transactions that will work to differentiate
> both of them for all types, which means spill, streaming and normal
> transactions. This will help users to measure the exact status of
> logical replication.

Could you please elaborate on use cases of the proposed statistics?
For example, the current statistics on pg_replication_slots can be
used for tuning logical_decoding_work_mem as well as inferring the
total amount of bytes passed to the output plugin. How will the user
use those statistics?

Also, if we want the stats of successful transactions why don't we
show the stats of successful transactions in the view instead of ones
of failed transactions?

>
> Attached file is the POC patch for this.
> Current design is to save failed stats data in the ReplicationSlot struct.
> This is because after the error, I'm not able to access the ReorderBuffer object.
> Thus, I chose the object where I can interact with at the ReplicationSlotRelease timing.

When discussing the pg_stat_replication_slots view, there was an idea
to store the slot statistics on ReplicationSlot struct. But the idea
was rejected mainly because the struct is on the shared buffer[1]. If
we store those counts on ReplicationSlot struct it increases the usage
of shared memory. And those statistics are used only by logical slots
and don’t necessarily need to be shared among the server processes.
Moreover, if we want to add more statistics on the view in the future,
it further increases the usage of shared memory. If we want to track
the stats of successful transactions, I think it's easier to track
them on the subscriber side rather than the publisher side. We can
increase counters when applying [stream]commit/abort logical changes
on the subscriber.

Regards,

[1] https://www.postgresql.org/message-id/CAA4eK1Kuj%2B3G59hh3wu86f4mmpQLpah_mGv2-wfAPyn%2BzT%3DP4A%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, July 29, 2021 10:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Jul 8, 2021 at 3:55 PM osumi.takamichi@fujitsu.com 
> <osumi.takamichi@fujitsu.com> wrote:
> > When the current HEAD fails during logical decoding, the failure 
> > increments txns count in pg_stat_replication_slots - [1] and adds 
> > the transaction size to the sum of bytes in the same repeatedly on 
> > the publisher, until the problem is solved.
> > One of the good examples is duplication error on the subscriber side 
> > and this applies to both streaming and spill cases as well.
> >
> > This update prevents users from grasping the exact number and size 
> > of successful and unsuccessful transactions. Accordingly, we need to 
> > have new columns of failed transactions that will work to 
> > differentiate both of them for all types, which means spill, 
> > streaming and normal transactions. This will help users to measure 
> > the exact status of logical replication.
> 
> Could you please elaborate on use cases of the proposed statistics?
> For example, the current statistics on pg_replication_slots can be 
> used for tuning logical_decoding_work_mem as well as inferring the 
> total amount of bytes passed to the output plugin. How will the user use those statistics?
> 
> Also, if we want the stats of successful transactions why don't we 
> show the stats of successful transactions in the view instead of ones 
> of failed transactions?
It works to show the ratio of successful and unsuccessful transactions,
which should be helpful in terms of administrator perspective.
FYI, the POC patch added the columns where I prefixed 'failed' to those names.
But, substantially, it meant the ratio when user compared normal columns and
newly introduced columns by this POC in the pg_stat_replication_slots.


> > Attached file is the POC patch for this.
> > Current design is to save failed stats data in the ReplicationSlot struct.
> > This is because after the error, I'm not able to access the 
> > ReorderBuffer
> object.
> > Thus, I chose the object where I can interact with at the
> ReplicationSlotRelease timing.
> 
> When discussing the pg_stat_replication_slots view, there was an idea 
> to store the slot statistics on ReplicationSlot struct. But the idea 
> was rejected mainly because the struct is on the shared buffer[1]. If 
> we store those counts on ReplicationSlot struct it increases the usage 
> of shared memory. And those statistics are used only by logical slots 
> and don’t necessarily need to be shared among the server processes.
Yes, I was aware of this.
I was not sure if this design will be expected or not for the enhancement,
I thought of changing the design accordingly once the idea gets accepted by the community.

> Moreover, if we want to add more statistics on the view in the future, 
> it further increases the usage of shared memory. If we want to track 
> the stats of successful transactions, I think it's easier to track 
> them on the subscriber side rather than the publisher side. We can 
> increase counters when applying [stream]commit/abort logical changes on the subscriber.
It's true that to track the stats of successful and unsuccessful transactions on the *sub*
is easier than on the pub. After some survey, it turned out that I cannot distinguish
the protocol messages between the cases of any failure (e.g. duplication error on the sub)
from user intentional and successful operations(e.g. DROP SUBSCRIPTION and ALTER SUBSCRIPTION DISABLE) on the pub.

If we truly want to achieve this change on the publisher side,
protocol change requires in order to make above cases distinguishable,
now I feel that it is better to do this in the subscriber side. 

Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
Sawada-san is now implementing a new view in [1].
Do you think that I should write a patch to introduce a new separate view
or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?


[1] - https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, July 29, 2021 10:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Thu, Jul 8, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > > When the current HEAD fails during logical decoding, the failure
> > > increments txns count in pg_stat_replication_slots - [1] and adds
> > > the transaction size to the sum of bytes in the same repeatedly on
> > > the publisher, until the problem is solved.
> > > One of the good examples is duplication error on the subscriber side
> > > and this applies to both streaming and spill cases as well.
> > >
> > > This update prevents users from grasping the exact number and size
> > > of successful and unsuccessful transactions. Accordingly, we need to
> > > have new columns of failed transactions that will work to
> > > differentiate both of them for all types, which means spill,
> > > streaming and normal transactions. This will help users to measure
> > > the exact status of logical replication.
> >
> > Could you please elaborate on use cases of the proposed statistics?
> > For example, the current statistics on pg_replication_slots can be
> > used for tuning logical_decoding_work_mem as well as inferring the
> > total amount of bytes passed to the output plugin. How will the user use those statistics?
> >
> > Also, if we want the stats of successful transactions why don't we
> > show the stats of successful transactions in the view instead of ones
> > of failed transactions?
> It works to show the ratio of successful and unsuccessful transactions,
> which should be helpful in terms of administrator perspective.
> FYI, the POC patch added the columns where I prefixed 'failed' to those names.
> But, substantially, it meant the ratio when user compared normal columns and
> newly introduced columns by this POC in the pg_stat_replication_slots.

What can the administrator use the ratio of successful and
unsuccessful logical replication transactions for? For example, IIUC
if a conflict happens on the subscriber as you mentioned, the
successful transaction ratio calculated by those statistics is getting
low, perhaps meaning the logical replication stopped. But it can be
checked also by checking pg_stat_replication view or
pg_replication_slots view (or error counts of
pg_stat_subscription_errors view I’m proposing[1]). Do you have other
use cases?

> > Moreover, if we want to add more statistics on the view in the future,
> > it further increases the usage of shared memory. If we want to track
> > the stats of successful transactions, I think it's easier to track
> > them on the subscriber side rather than the publisher side. We can
> > increase counters when applying [stream]commit/abort logical changes on the subscriber.
> It's true that to track the stats of successful and unsuccessful transactions on the *sub*
> is easier than on the pub. After some survey, it turned out that I cannot distinguish
> the protocol messages between the cases of any failure (e.g. duplication error on the sub)
> from user intentional and successful operations(e.g. DROP SUBSCRIPTION and ALTER SUBSCRIPTION DISABLE) on the pub.
>
> If we truly want to achieve this change on the publisher side,
> protocol change requires in order to make above cases distinguishable,
> now I feel that it is better to do this in the subscriber side.
>
> Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
> Sawada-san is now implementing a new view in [1].
> Do you think that I should write a patch to introduce a new separate view
> or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?

pg_stat_subscriptions_errors view I'm proposing is a view showing the
details of error happening during logical replication. So I think a
separate view or pg_stat_subscription view would be a more appropriate
place.

Regards,

[1] https://www.postgresql.org/message-id/flat/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> >
> > Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
> > Sawada-san is now implementing a new view in [1].
> > Do you think that I should write a patch to introduce a new separate view
> > or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?
>
> pg_stat_subscriptions_errors view I'm proposing is a view showing the
> details of error happening during logical replication. So I think a
> separate view or pg_stat_subscription view would be a more appropriate
> place.
>

+1 for having these stats in pg_stat_subscription. Do we want to add
two columns (xact_commit: number of transactions successfully applied
in this subscription, xact_rollback: number of transactions that have
been rolled back in this subscription) or do you guys have something
else in mind?

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > >
> > > Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
> > > Sawada-san is now implementing a new view in [1].
> > > Do you think that I should write a patch to introduce a new separate view
> > > or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?
> >
> > pg_stat_subscriptions_errors view I'm proposing is a view showing the
> > details of error happening during logical replication. So I think a
> > separate view or pg_stat_subscription view would be a more appropriate
> > place.
> >
>
> +1 for having these stats in pg_stat_subscription. Do we want to add
> two columns (xact_commit: number of transactions successfully applied
> in this subscription, xact_rollback: number of transactions that have
> been rolled back in this subscription)

Sounds good. We might want to have separate counters for the number of
transactions failed due to an error and transactions rolled back by
stream_abort.

pg_stat_subscription currently shows logical replication worker stats
on the shared memory. I think xact_commit and xact_rollback should
survive beyond the server restarts, so it would be better to be
collected by the stats collector. My skipping transaction patch adds a
hash table of which entry represents a subscription stats. I guess we
can use the hash table so that one subscription stats entry has both
transaction stats and errors.

>  or do you guys have something else in mind?

Osumi-san was originally concerned that there is no way to grasp the
exact number and size of successful and unsuccessful transactions. The
above idea covers only the number of successful and unsuccessful
transactions but not the size. What do you think, Osumi-san?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Thursday, July 29, 2021 10:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > On Thu, Jul 8, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > > When the current HEAD fails during logical decoding, the failure
> > > > increments txns count in pg_stat_replication_slots - [1] and adds
> > > > the transaction size to the sum of bytes in the same repeatedly on
> > > > the publisher, until the problem is solved.
> > > > One of the good examples is duplication error on the subscriber side
> > > > and this applies to both streaming and spill cases as well.
> > > >
> > > > This update prevents users from grasping the exact number and size
> > > > of successful and unsuccessful transactions. Accordingly, we need to
> > > > have new columns of failed transactions that will work to
> > > > differentiate both of them for all types, which means spill,
> > > > streaming and normal transactions. This will help users to measure
> > > > the exact status of logical replication.
> > >
> > > Could you please elaborate on use cases of the proposed statistics?
> > > For example, the current statistics on pg_replication_slots can be
> > > used for tuning logical_decoding_work_mem as well as inferring the
> > > total amount of bytes passed to the output plugin. How will the user use those statistics?
> > >
> > > Also, if we want the stats of successful transactions why don't we
> > > show the stats of successful transactions in the view instead of ones
> > > of failed transactions?
> > It works to show the ratio of successful and unsuccessful transactions,
> > which should be helpful in terms of administrator perspective.
> > FYI, the POC patch added the columns where I prefixed 'failed' to those names.
> > But, substantially, it meant the ratio when user compared normal columns and
> > newly introduced columns by this POC in the pg_stat_replication_slots.
>
> What can the administrator use the ratio of successful and
> unsuccessful logical replication transactions for? For example, IIUC
> if a conflict happens on the subscriber as you mentioned, the
> successful transaction ratio calculated by those statistics is getting
> low, perhaps meaning the logical replication stopped. But it can be
> checked also by checking pg_stat_replication view or
> pg_replication_slots view (or error counts of
> pg_stat_subscription_errors view I’m proposing[1]). Do you have other
> use cases?

We could also include failed_data_size, this will help us to identify
the actual bandwidth consumed by the failed transaction. It will help
the DBA's to understand the network consumption in a better way.
Currently only total transaction and total data will be available but
when there is a failure, the failed transaction data will be sent
repeatedly, if the DBA does not solve the actual cause of failure,
there can be significant consumption of the network due to failure
transaction being sent repeatedly. DBA will not be able to understand
why there is so much network bandwidth consumption. If we give the
failed transaction information, the DBA might not get alarmed in this
case and understand that the network consumption is genuine. Also it
will help monitoring tools to project this value.

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, August 3, 2021 2:29 PM  Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > >
> > > > Accordingly, I'm thinking to have unsuccessful and successful stats on the
> sub side.
> > > > Sawada-san is now implementing a new view in [1].
> > > > Do you think that I should write a patch to introduce a new
> > > > separate view or write a patch to add more columns to the new view
> "pg_stat_subscription_errors" that is added at [1] ?
> > >
> > > pg_stat_subscriptions_errors view I'm proposing is a view showing
> > > the details of error happening during logical replication. So I
> > > think a separate view or pg_stat_subscription view would be a more
> > > appropriate place.
> > >
> >
> > +1 for having these stats in pg_stat_subscription. Do we want to add
> > two columns (xact_commit: number of transactions successfully applied
> > in this subscription, xact_rollback: number of transactions that have
> > been rolled back in this subscription)
> 
> Sounds good. We might want to have separate counters for the number of
> transactions failed due to an error and transactions rolled back by stream_abort.
Okay. I wanna make those separate as well for this feature.


> pg_stat_subscription currently shows logical replication worker stats on the
> shared memory. I think xact_commit and xact_rollback should survive beyond
> the server restarts, so it would be better to be collected by the stats collector.
> My skipping transaction patch adds a hash table of which entry represents a
> subscription stats. I guess we can use the hash table so that one subscription
> stats entry has both transaction stats and errors.
> 
> >  or do you guys have something else in mind?
> 
> Osumi-san was originally concerned that there is no way to grasp the exact
> number and size of successful and unsuccessful transactions. The above idea
> covers only the number of successful and unsuccessful transactions but not the
> size. What do you think, Osumi-san?
Yeah, I think tracking sizes of failed transactions and roll-backed transactions
is helpful to identify the genuine network consumption,
as mentioned by Vignesh in another mail.
I'd like to include those also and post a patch for this.

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Tue, Aug 3, 2021 at 10:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > >
> > > > Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
> > > > Sawada-san is now implementing a new view in [1].
> > > > Do you think that I should write a patch to introduce a new separate view
> > > > or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?
> > >
> > > pg_stat_subscriptions_errors view I'm proposing is a view showing the
> > > details of error happening during logical replication. So I think a
> > > separate view or pg_stat_subscription view would be a more appropriate
> > > place.
> > >
> >
> > +1 for having these stats in pg_stat_subscription. Do we want to add
> > two columns (xact_commit: number of transactions successfully applied
> > in this subscription, xact_rollback: number of transactions that have
> > been rolled back in this subscription)
>
> Sounds good. We might want to have separate counters for the number of
> transactions failed due to an error and transactions rolled back by
> stream_abort.
>

I was trying to think based on similar counters in pg_stat_database
but if you think there is a value in showing errored and actual
rollbacked transactions separately then we can do that but how do you
think one can make use of it?

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Aug 3, 2021 at 6:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 3, 2021 at 10:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > > >
> > > > >
> > > > > Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
> > > > > Sawada-san is now implementing a new view in [1].
> > > > > Do you think that I should write a patch to introduce a new separate view
> > > > > or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?
> > > >
> > > > pg_stat_subscriptions_errors view I'm proposing is a view showing the
> > > > details of error happening during logical replication. So I think a
> > > > separate view or pg_stat_subscription view would be a more appropriate
> > > > place.
> > > >
> > >
> > > +1 for having these stats in pg_stat_subscription. Do we want to add
> > > two columns (xact_commit: number of transactions successfully applied
> > > in this subscription, xact_rollback: number of transactions that have
> > > been rolled back in this subscription)
> >
> > Sounds good. We might want to have separate counters for the number of
> > transactions failed due to an error and transactions rolled back by
> > stream_abort.
> >
>
> I was trying to think based on similar counters in pg_stat_database
> but if you think there is a value in showing errored and actual
> rollbacked transactions separately then we can do that but how do you
> think one can make use of it?

I'm concerned that the value that includes both errored and actual
rollbacked transactions doesn't make sense in practice since the
number of errored transactions can easily get increased once a
conflict happens. IMO the errored transaction doesn’t not necessarily
necessary since the number of (successive) errors that happened on the
subscription is tracked by pg_stat_subscription_errors view. So it
might be enough to have actual rollbacked transactions. If this value
is high, it's likely that many rollbacked transactions are streamed,
unnecessarily consuming network bandwidth. So the user might want to
increase logical_decoding_work_mem to suppress transactions getting
streamed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Aug 4, 2021 at 6:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Aug 3, 2021 at 6:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I was trying to think based on similar counters in pg_stat_database
> > but if you think there is a value in showing errored and actual
> > rollbacked transactions separately then we can do that but how do you
> > think one can make use of it?
>
> I'm concerned that the value that includes both errored and actual
> rollbacked transactions doesn't make sense in practice since the
> number of errored transactions can easily get increased once a
> conflict happens. IMO the errored transaction doesn’t not necessarily
> necessary since the number of (successive) errors that happened on the
> subscription is tracked by pg_stat_subscription_errors view.
>

It sounds awkward to display two of the xact (xact_commit,
xact_rollback) counters in one view and then the other similar counter
(xact_error or something like that) in another view. Isn't it better
to display all of them together possibly in pg_stat_subscription? I
guess it might be a bit tricky to track counters for tablesync workers
separately but we can track them in the corresponding subscription.

> So it
> might be enough to have actual rollbacked transactions. If this value
> is high, it's likely that many rollbacked transactions are streamed,
> unnecessarily consuming network bandwidth. So the user might want to
> increase logical_decoding_work_mem to suppress transactions getting
> streamed.
>

Okay, we might want to probably document such a use case.

--
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Wed, Aug 4, 2021 at 12:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 4, 2021 at 6:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Aug 3, 2021 at 6:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I was trying to think based on similar counters in pg_stat_database
> > > but if you think there is a value in showing errored and actual
> > > rollbacked transactions separately then we can do that but how do you
> > > think one can make use of it?
> >
> > I'm concerned that the value that includes both errored and actual
> > rollbacked transactions doesn't make sense in practice since the
> > number of errored transactions can easily get increased once a
> > conflict happens. IMO the errored transaction doesn’t not necessarily
> > necessary since the number of (successive) errors that happened on the
> > subscription is tracked by pg_stat_subscription_errors view.
> >
>
> It sounds awkward to display two of the xact (xact_commit,
> xact_rollback) counters in one view and then the other similar counter
> (xact_error or something like that) in another view. Isn't it better
> to display all of them together possibly in pg_stat_subscription? I
> guess it might be a bit tricky to track counters for tablesync workers
> separately but we can track them in the corresponding subscription.

I meant that the number of rolled back transactions due to an error
seems not to be necessary since pg_stat_subscription_errors has a
similar value. So what I imagined is that we have xact_commit and
xact_rollback (counting only actual rollbacked transaction) counters
in pg_stat_subscription.  What do you think of this idea? Or did you
mean the number of errored transactions also has value so should be
included in pg_stat_subscription along with xact_commit and
xact_rollback?

Originally I thought your proposal of having the number of rollback
transactions includes both errored transactions and actual rolled back
transactions so my point was it's better to separate them and include
only the actual rolled-back transaction.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Aug 4, 2021 at 12:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 4, 2021 at 12:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Aug 4, 2021 at 6:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Aug 3, 2021 at 6:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > I was trying to think based on similar counters in pg_stat_database
> > > > but if you think there is a value in showing errored and actual
> > > > rollbacked transactions separately then we can do that but how do you
> > > > think one can make use of it?
> > >
> > > I'm concerned that the value that includes both errored and actual
> > > rollbacked transactions doesn't make sense in practice since the
> > > number of errored transactions can easily get increased once a
> > > conflict happens. IMO the errored transaction doesn’t not necessarily
> > > necessary since the number of (successive) errors that happened on the
> > > subscription is tracked by pg_stat_subscription_errors view.
> > >
> >
> > It sounds awkward to display two of the xact (xact_commit,
> > xact_rollback) counters in one view and then the other similar counter
> > (xact_error or something like that) in another view. Isn't it better
> > to display all of them together possibly in pg_stat_subscription? I
> > guess it might be a bit tricky to track counters for tablesync workers
> > separately but we can track them in the corresponding subscription.
>
> I meant that the number of rolled back transactions due to an error
> seems not to be necessary since pg_stat_subscription_errors has a
> similar value.
>

I got that point.

> So what I imagined is that we have xact_commit and
> xact_rollback (counting only actual rollbacked transaction) counters
> in pg_stat_subscription.  What do you think of this idea? Or did you
> mean the number of errored transactions also has value so should be
> included in pg_stat_subscription along with xact_commit and
> xact_rollback?
>

I meant the later one. I think it might be better to display all three
(xact_commit, xact_abort, xact_error) in one place. Earlier it made
sense to me to display it in pg_stat_subscription_errors but not sure
after this proposal. Won't it be better for users to see all the
counters in one view?

> Originally I thought your proposal of having the number of rollback
> transactions includes both errored transactions and actual rolled back
> transactions so my point was it's better to separate them and include
> only the actual rolled-back transaction.
>

I am fine with your proposal to separate the actual rollback and error
transactions counter but I thought it would be better to display them
in one view.

--
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hi


I've made a new patch to extend pg_stat_subscription
as suggested in [1] to have columns
xact_commit, xact_error and independent xact_abort mentioned in [2].
Also, during discussion, we touched a topic if we should
include data sizes for each column above and concluded that it's
better to have ones. Accordingly, I've implemented corresponding
columns to show the data sizes as well.

Note that this patch depends on v12 patchset of apply error callback
provided in [3]. Therfore, applying the patchset first is required,
if you would like to test my patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BtOV-%2BssGjj1zq%2BnAL8a9LfPsxbtyupZGvZ0U7nV0A7g%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1KMT8biciVqTBoZ9gYV-Gf297JFeNhJaxZNmFrZL8m2jA%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAD21AoA5HrhXqwbYLpSobGzV6rWoJmH3-NB9J3YarKDwARBj4w%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hi

On Thursday, September 2, 2021 11:23 AM I wrote:
> I've made a new patch to extend pg_stat_subscription as suggested in [1] to
> have columns xact_commit, xact_error and independent xact_abort mentioned
> in [2].
> Also, during discussion, we touched a topic if we should include data sizes for
> each column above and concluded that it's better to have ones. Accordingly,
> I've implemented corresponding columns to show the data sizes as well.
I've updated my previous patch of subscriber's stats.
The main change of this version is
the bytes calculation that are exported by pg_stat_subscription.
I prepared a new function which is equivalent to ReorderBufferChangeSize
on the subscriber side to calculate the resource consumption.
It's because this is in line with actual process of the subscriber.

Other changes are minor and cosmetic.

Also, this patch is based on the v12 patch-set of skip xid,
as described in my previous email.
Note that you need to use past commit-id if you want to apply v12 set.
I'll update mine as well when the latest patch-set v13 is shared on hackers.

Best Regards,
    Takamichi Osumi

Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hi

On Thursday, September 9, 2021 5:04 PM I wrote:
> Also, this patch is based on the v12 patch-set of skip xid, as described in my
> previous email.
> Note that you need to use past commit-id if you want to apply v12 set.
> I'll update mine as well when the latest patch-set v13 is shared on hackers.
Rebased, using v13 patch-set in [1].


[1 ] - https://www.postgresql.org/message-id/CAD21AoBUXM4ODfPFa%3Dh7M6vSKwOKysapUce3tS4rs9mfVMm%2BcQ%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hello


Just conducted some cosmetic changes
and rebased my patch, using v14 patch-set in [1].

[1] - https://www.postgresql.org/message-id/CAD21AoCO_ZYWZEBw7ziiYoX7Zm1P0L9%3Dd7Jj9YsGEGsT9o6wmw%40mail.gmail.com


Best Regards,
    Takamichi Osumi

Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Sep 22, 2021 at 10:10 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Just conducted some cosmetic changes
> and rebased my patch, using v14 patch-set in [1].
>

IIUC, this proposal will allow new xact stats for subscriptions via
pg_stat_subscription. One thing that is not clear to me in this patch
is that why you choose a different way to store these stats than the
existing stats in that view? AFAICS, the other existing stats are
stored in-memory in LogicalRepWorker whereas these new stats are
stored/fetched via stats collector means these will persist. Isn't it
better to be consistent here? I am not sure which is a more
appropriate way to store these stats and I would like to hear your and
other's thoughts on that matter but it appears a bit awkward to me
that some of the stats in the same view are persistent and others are
in-memory.

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, September 27, 2021 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Sep 22, 2021 at 10:10 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Just conducted some cosmetic changes
> > and rebased my patch, using v14 patch-set in [1].
> >
> 
> IIUC, this proposal will allow new xact stats for subscriptions via
> pg_stat_subscription. One thing that is not clear to me in this patch is that why
> you choose a different way to store these stats than the existing stats in that
> view? AFAICS, the other existing stats are stored in-memory in
> LogicalRepWorker whereas these new stats are stored/fetched via stats
> collector means these will persist. Isn't it better to be consistent here? I am not
> sure which is a more appropriate way to store these stats and I would like to
> hear your and other's thoughts on that matter but it appears a bit awkward to
> me that some of the stats in the same view are persistent and others are
> in-memory.
Yeah, existing stats values of pg_stat_subscription are in-memory.
I thought xact stats should survive over the restart,
to summarize and show all accumulative transaction values
on one subscription for user. But, your pointing out is reasonable,
mixing two types can be awkward and lack of consistency.

Then, if, we proceed in this direction,
the place to implement those stats
would be on the LogicalRepWorker struct, instead ?

On one hand, what confuses me is that
in another thread of feature to skip xid,
I wondered if Sawada-san has started to take
those xact stats into account (probably in his patch-set),
because the stats values this thread is taking care of are listed up in the thread.
If that's true, its thread and this thread are getting really close.
So, IIUC, we have to discuss this point as well.

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Tue, Sep 28, 2021 at 7:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, September 27, 2021 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Sep 22, 2021 at 10:10 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Just conducted some cosmetic changes
> > > and rebased my patch, using v14 patch-set in [1].
> > >
> >
> > IIUC, this proposal will allow new xact stats for subscriptions via
> > pg_stat_subscription. One thing that is not clear to me in this patch is that why
> > you choose a different way to store these stats than the existing stats in that
> > view? AFAICS, the other existing stats are stored in-memory in
> > LogicalRepWorker whereas these new stats are stored/fetched via stats
> > collector means these will persist. Isn't it better to be consistent here? I am not
> > sure which is a more appropriate way to store these stats and I would like to
> > hear your and other's thoughts on that matter but it appears a bit awkward to
> > me that some of the stats in the same view are persistent and others are
> > in-memory.
> Yeah, existing stats values of pg_stat_subscription are in-memory.
> I thought xact stats should survive over the restart,
> to summarize and show all accumulative transaction values
> on one subscription for user. But, your pointing out is reasonable,
> mixing two types can be awkward and lack of consistency.
>
> Then, if, we proceed in this direction,
> the place to implement those stats
> would be on the LogicalRepWorker struct, instead ?
>

Or, we can make existing stats persistent and then add these stats on
top of it. Sawada-San, do you have any thoughts on this matter?

> On one hand, what confuses me is that
> in another thread of feature to skip xid,
> I wondered if Sawada-san has started to take
> those xact stats into account (probably in his patch-set),
> because the stats values this thread is taking care of are listed up in the thread.
>

I don't think the Skip Xid patch is going to take care of these
additional stats but Sawada-San can confirm.

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 28, 2021 at 7:25 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, September 27, 2021 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Wed, Sep 22, 2021 at 10:10 AM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > Just conducted some cosmetic changes
> > > > and rebased my patch, using v14 patch-set in [1].
> > > >
> > >
> > > IIUC, this proposal will allow new xact stats for subscriptions via
> > > pg_stat_subscription. One thing that is not clear to me in this patch is that why
> > > you choose a different way to store these stats than the existing stats in that
> > > view? AFAICS, the other existing stats are stored in-memory in
> > > LogicalRepWorker whereas these new stats are stored/fetched via stats
> > > collector means these will persist. Isn't it better to be consistent here? I am not
> > > sure which is a more appropriate way to store these stats and I would like to
> > > hear your and other's thoughts on that matter but it appears a bit awkward to
> > > me that some of the stats in the same view are persistent and others are
> > > in-memory.
> > Yeah, existing stats values of pg_stat_subscription are in-memory.
> > I thought xact stats should survive over the restart,
> > to summarize and show all accumulative transaction values
> > on one subscription for user. But, your pointing out is reasonable,
> > mixing two types can be awkward and lack of consistency.

I remembered that we have discussed a similar thing when discussing
pg_stat_replication_slots view[1]. The slot statistics such as
spill_txn, spill_count, and spill_bytes were originally shown in
pg_stat_replication, mixing cumulative counters and dynamic counters.
But we concluded to separate these cumulative values from
pg_stat_replication view and introduce a new pg_stat_replication_slots
view.

> >
> > Then, if, we proceed in this direction,
> > the place to implement those stats
> > would be on the LogicalRepWorker struct, instead ?
> >
>
> Or, we can make existing stats persistent and then add these stats on
> top of it. Sawada-San, do you have any thoughts on this matter?

I think that making existing stats including received_lsn and
last_msg_receipt_time persistent by using stats collector could cause
massive reporting messages. We can report these messages with a
certain interval to reduce the amount of messages but we will end up
seeing old stats on the view. Another idea could be to have a separate
view, say pg_stat_subscription_xact but I'm not sure it's a better
idea.

>
> > On one hand, what confuses me is that
> > in another thread of feature to skip xid,
> > I wondered if Sawada-san has started to take
> > those xact stats into account (probably in his patch-set),
> > because the stats values this thread is taking care of are listed up in the thread.
> >
>
> I don't think the Skip Xid patch is going to take care of these
> additional stats but Sawada-San can confirm.

Yes, I don't take care of these additional stats discussed on this
thread. The reason why I recently mentioned these statistics on the
skip_xid thread is that since I thought that this patch will be built
on top of the subscription error reporting patch that I'm proposing I
discussed the extensibility of my patch.

Regards,

[1] https://www.postgresql.org/message-id/CABUevEwayASgA2GHAQx%3DVtCp6OQh5PVfem6JDQ97gFHka%3D6n1w%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > Then, if, we proceed in this direction,
> > > the place to implement those stats
> > > would be on the LogicalRepWorker struct, instead ?
> > >
> >
> > Or, we can make existing stats persistent and then add these stats on
> > top of it. Sawada-San, do you have any thoughts on this matter?
>
> I think that making existing stats including received_lsn and
> last_msg_receipt_time persistent by using stats collector could cause
> massive reporting messages. We can report these messages with a
> certain interval to reduce the amount of messages but we will end up
> seeing old stats on the view.
>

Can't we keep the current and new stats both in-memory and persist on
disk? So, the persistent stats data will be used to fill the in-memory
counters after restarting of workers, otherwise, we will always refer
to in-memory values.

> Another idea could be to have a separate
> view, say pg_stat_subscription_xact but I'm not sure it's a better
> idea.
>

Yeah, that is another idea but I am afraid that having three different
views for subscription stats will be too much. I think it would be
better if we can display these additional stats via the existing view
pg_stat_subscription or the new view pg_stat_subscription_errors (or
whatever name we want to give it).

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Greg Nancarrow
Дата:
On Tue, Sep 28, 2021 at 8:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Another idea could be to have a separate
> > view, say pg_stat_subscription_xact but I'm not sure it's a better
> > idea.
> >
>
> Yeah, that is another idea but I am afraid that having three different
> views for subscription stats will be too much. I think it would be
> better if we can display these additional stats via the existing view
> pg_stat_subscription or the new view pg_stat_subscription_errors (or
> whatever name we want to give it).
>


It seems that we have come full-circle in the discussion of how these
new stats could be best added.
Other than adding a new "pg_stats_subscription_xact" view (which Amit
thought would result in too many views) I don't see any clear solution
here, because the new xact-related cumulative stats don't really
consistently integrate with the existing in-memory stats in
pg_stat_subscription, and IMHO the new stats wouldn't integrate well
into the existing error-related stats in the
"pg_stat_subscription_errors" view (even if its name was changed, that
view in any case maintains lowel-level error details and the way error
entries are removed doesn't allow the "success" stats to be maintained
for the subscription and doesn't fit well with the added
"pg_stat_reset_subscription_error" function either).
If we can't really add another view like "pg_stats_subscription_xact",
it seems we need to find some way the new stats fit more consistently
into the existing "pg_stat_subscription" view.

Regards,
Greg Nancarrow
Fujitsu Australia



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hi,


Thank you, Amit-san and Sawada-san for the discussion.
On Tuesday, September 28, 2021 7:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Another idea could be to have a separate view, say
> > pg_stat_subscription_xact but I'm not sure it's a better idea.
> >
> 
> Yeah, that is another idea but I am afraid that having three different
> views for subscription stats will be too much. I think it would be
> better if we can display these additional stats via the existing view
> pg_stat_subscription or the new view pg_stat_subscription_errors (or
> whatever name we want to give it).
pg_stat_subscription_errors specializes in showing an error record.
So, it would be awkward to combine it with other normal xact stats.


> > > > Then, if, we proceed in this direction, the place to implement
> > > > those stats would be on the LogicalRepWorker struct, instead ?
> > > >
> > >
> > > Or, we can make existing stats persistent and then add these stats
> > > on top of it. Sawada-San, do you have any thoughts on this matter?
> >
> > I think that making existing stats including received_lsn and
> > last_msg_receipt_time persistent by using stats collector could cause
> > massive reporting messages. We can report these messages with a
> > certain interval to reduce the amount of messages but we will end up
> > seeing old stats on the view.
> >
> 
> Can't we keep the current and new stats both in-memory and persist on disk?
> So, the persistent stats data will be used to fill the in-memory counters after
> restarting of workers, otherwise, we will always refer to in-memory values.
I felt this isn't impossible.
When we have to update the values of the xact stats is
the end of message apply for COMMIT, COMMIT PREPARED, STREAM_ABORT and etc
or the time when an error happens during apply. Then, if we want,
we can update xact stats values at such moments accordingly.
I'm thinking that we will have a hash table whose key is a pair of subid + relid
and entry is a proposed stats structure and update the entry,
depending on the above timings.

Here, one thing a bit unclear to me is
whether we should move existing stats of pg_stat_subscription
(such as last_lsn and reply_lsn) to the hash entry or not.
Now, in pg_stat_get_subscription() for pg_stat_subscription view,
current stats values are referenced directly from (a copy of)
existing LogicalRepCtx->workers. I felt that we need to avoid
a situation that some existing data are fetched from LogicalRepWorker
and other new xact stats are from the hash in the function,
in order to keeping the alignment of the function. Was this correct ?

Another thing we need to talk is where we put a new file
of contents of pg_stat_subscription. I'm thinking that
it's pg_logical/, because above idea does not interact with
stats collector any more.

Let me know if I miss something.


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, September 29, 2021 2:59 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Tue, Sep 28, 2021 at 8:04 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > > Another idea could be to have a separate view, say
> > > pg_stat_subscription_xact but I'm not sure it's a better idea.
> > >
> >
> > Yeah, that is another idea but I am afraid that having three different
> > views for subscription stats will be too much. I think it would be
> > better if we can display these additional stats via the existing view
> > pg_stat_subscription or the new view pg_stat_subscription_errors (or
> > whatever name we want to give it).
> 
> It seems that we have come full-circle in the discussion of how these new stats
> could be best added.
> Other than adding a new "pg_stats_subscription_xact" view (which Amit
> thought would result in too many views) I don't see any clear solution here,
> because the new xact-related cumulative stats don't really consistently
> integrate with the existing in-memory stats in pg_stat_subscription, and IMHO
> the new stats wouldn't integrate well into the existing error-related stats in the
> "pg_stat_subscription_errors" view (even if its name was changed, that view in
> any case maintains lowel-level error details and the way error entries are
> removed doesn't allow the "success" stats to be maintained for the
> subscription and doesn't fit well with the added
> "pg_stat_reset_subscription_error" function either).
> If we can't really add another view like "pg_stats_subscription_xact", it seems
> we need to find some way the new stats fit more consistently into the existing
> "pg_stat_subscription" view.
Yeah, I agree with your conclusion.
I feel we cannot avoid changing some base of pg_stat_subscription
for the xact stats.



Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Sep 29, 2021 at 11:35 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Hi,
>
>
> Thank you, Amit-san and Sawada-san for the discussion.
> On Tuesday, September 28, 2021 7:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Another idea could be to have a separate view, say
> > > pg_stat_subscription_xact but I'm not sure it's a better idea.
> > >
> >
> > Yeah, that is another idea but I am afraid that having three different
> > views for subscription stats will be too much. I think it would be
> > better if we can display these additional stats via the existing view
> > pg_stat_subscription or the new view pg_stat_subscription_errors (or
> > whatever name we want to give it).
> pg_stat_subscription_errors specializes in showing an error record.
> So, it would be awkward to combine it with other normal xact stats.
>
>
> > > > > Then, if, we proceed in this direction, the place to implement
> > > > > those stats would be on the LogicalRepWorker struct, instead ?
> > > > >
> > > >
> > > > Or, we can make existing stats persistent and then add these stats
> > > > on top of it. Sawada-San, do you have any thoughts on this matter?
> > >
> > > I think that making existing stats including received_lsn and
> > > last_msg_receipt_time persistent by using stats collector could cause
> > > massive reporting messages. We can report these messages with a
> > > certain interval to reduce the amount of messages but we will end up
> > > seeing old stats on the view.
> > >
> >
> > Can't we keep the current and new stats both in-memory and persist on disk?
> > So, the persistent stats data will be used to fill the in-memory counters after
> > restarting of workers, otherwise, we will always refer to in-memory values.
> I felt this isn't impossible.
> When we have to update the values of the xact stats is
> the end of message apply for COMMIT, COMMIT PREPARED, STREAM_ABORT and etc
> or the time when an error happens during apply. Then, if we want,
> we can update xact stats values at such moments accordingly.
> I'm thinking that we will have a hash table whose key is a pair of subid + relid
> and entry is a proposed stats structure and update the entry,
> depending on the above timings.
>

Are you thinking of a separate hash table then what we are going to
create for Sawada-San's patch related to error stats? Isn't it
possible to have stats in the same hash table and same file?

> Here, one thing a bit unclear to me is
> whether we should move existing stats of pg_stat_subscription
> (such as last_lsn and reply_lsn) to the hash entry or not.
>

I think we should move it to hash entry. I think that is an
improvement over what we have now because now after restart those
stats gets lost.

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > Then, if, we proceed in this direction,
> > > > the place to implement those stats
> > > > would be on the LogicalRepWorker struct, instead ?
> > > >
> > >
> > > Or, we can make existing stats persistent and then add these stats on
> > > top of it. Sawada-San, do you have any thoughts on this matter?
> >
> > I think that making existing stats including received_lsn and
> > last_msg_receipt_time persistent by using stats collector could cause
> > massive reporting messages. We can report these messages with a
> > certain interval to reduce the amount of messages but we will end up
> > seeing old stats on the view.
> >
>
> Can't we keep the current and new stats both in-memory and persist on
> disk? So, the persistent stats data will be used to fill the in-memory
> counters after restarting of workers, otherwise, we will always refer
> to in-memory values.

Interesting. Probably we can have apply workers and table sync workers
send their statistics to the stats collector at exit (before the stats
collector shutting down)? And the startup process will restore them at
restart?

>
> > Another idea could be to have a separate
> > view, say pg_stat_subscription_xact but I'm not sure it's a better
> > idea.
> >
>
> Yeah, that is another idea but I am afraid that having three different
> views for subscription stats will be too much.

Yeah, I have the same feeling.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Sep 29, 2021 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > >
> > > > > Then, if, we proceed in this direction,
> > > > > the place to implement those stats
> > > > > would be on the LogicalRepWorker struct, instead ?
> > > > >
> > > >
> > > > Or, we can make existing stats persistent and then add these stats on
> > > > top of it. Sawada-San, do you have any thoughts on this matter?
> > >
> > > I think that making existing stats including received_lsn and
> > > last_msg_receipt_time persistent by using stats collector could cause
> > > massive reporting messages. We can report these messages with a
> > > certain interval to reduce the amount of messages but we will end up
> > > seeing old stats on the view.
> > >
> >
> > Can't we keep the current and new stats both in-memory and persist on
> > disk? So, the persistent stats data will be used to fill the in-memory
> > counters after restarting of workers, otherwise, we will always refer
> > to in-memory values.
>
> Interesting. Probably we can have apply workers and table sync workers
> send their statistics to the stats collector at exit (before the stats
> collector shutting down)? And the startup process will restore them at
> restart?
>

I think we do need to send at the exit but we should probably send at
some other regular interval as well to avoid losing all the stats
after the crash-restart situation.

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
Osumi, Takamichi/大墨 昂道
Дата:
On Wednesday, September 29, 2021 7:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Sep 29, 2021 at 11:35 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > Thank you, Amit-san and Sawada-san for the discussion.
> > On Tuesday, September 28, 2021 7:05 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > Another idea could be to have a separate view, say
> > > > pg_stat_subscription_xact but I'm not sure it's a better idea.
> > > >
> > >
> > > Yeah, that is another idea but I am afraid that having three
> > > different views for subscription stats will be too much. I think it
> > > would be better if we can display these additional stats via the
> > > existing view pg_stat_subscription or the new view
> > > pg_stat_subscription_errors (or whatever name we want to give it).
> > pg_stat_subscription_errors specializes in showing an error record.
> > So, it would be awkward to combine it with other normal xact stats.
> >
> >
> > > > > > Then, if, we proceed in this direction, the place to implement
> > > > > > those stats would be on the LogicalRepWorker struct, instead ?
> > > > > >
> > > > >
> > > > > Or, we can make existing stats persistent and then add these
> > > > > stats on top of it. Sawada-San, do you have any thoughts on this
> matter?
> > > >
> > > > I think that making existing stats including received_lsn and
> > > > last_msg_receipt_time persistent by using stats collector could
> > > > cause massive reporting messages. We can report these messages
> > > > with a certain interval to reduce the amount of messages but we
> > > > will end up seeing old stats on the view.
> > > >
> > >
> > > Can't we keep the current and new stats both in-memory and persist on
> disk?
> > > So, the persistent stats data will be used to fill the in-memory
> > > counters after restarting of workers, otherwise, we will always refer to
> in-memory values.
> > I felt this isn't impossible.
> > When we have to update the values of the xact stats is the end of
> > message apply for COMMIT, COMMIT PREPARED, STREAM_ABORT and etc
> or the
> > time when an error happens during apply. Then, if we want, we can
> > update xact stats values at such moments accordingly.
> > I'm thinking that we will have a hash table whose key is a pair of
> > subid + relid and entry is a proposed stats structure and update the
> > entry, depending on the above timings.
> >
> 
> Are you thinking of a separate hash table then what we are going to create for
> Sawada-San's patch related to error stats? Isn't it possible to have stats in the
> same hash table and same file?
IIUC, this would be possible.

At the beginning, I thought we don't use stats collector at all for the xact stats
with the existing stats like received_lsn and last_msg_receipt_time,
considering the concern of too many reporting messages for them
to keep them always updated.  But, when we send
messages to stats collector only strictly limited times,
like successful exit and some regular interval as you explained in your other email,
this would disappear and I thought those new xact stats + moved stats can coexist in the hash
proposed by you in [1] and we can write those new stats in the same file.
Does everyone agree ?


> > Here, one thing a bit unclear to me is whether we should move existing
> > stats of pg_stat_subscription (such as last_lsn and reply_lsn) to the
> > hash entry or not.
> >
> 
> I think we should move it to hash entry. I think that is an improvement over what
> we have now because now after restart those stats gets lost.
Okay !


[1] - https://www.postgresql.org/message-id/CAA4eK1JRCQ-bYnbkwUrvcVcbLURjtiW%2BirFVvXzeG%2Bj%3Dy6jVgA%40mail.gmail.com


Best Regards,
    Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Wed, Sep 29, 2021 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Sep 29, 2021 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > >
> > > > > > Then, if, we proceed in this direction,
> > > > > > the place to implement those stats
> > > > > > would be on the LogicalRepWorker struct, instead ?
> > > > > >
> > > > >
> > > > > Or, we can make existing stats persistent and then add these stats on
> > > > > top of it. Sawada-San, do you have any thoughts on this matter?
> > > >
> > > > I think that making existing stats including received_lsn and
> > > > last_msg_receipt_time persistent by using stats collector could cause
> > > > massive reporting messages. We can report these messages with a
> > > > certain interval to reduce the amount of messages but we will end up
> > > > seeing old stats on the view.
> > > >
> > >
> > > Can't we keep the current and new stats both in-memory and persist on
> > > disk? So, the persistent stats data will be used to fill the in-memory
> > > counters after restarting of workers, otherwise, we will always refer
> > > to in-memory values.
> >
> > Interesting. Probably we can have apply workers and table sync workers
> > send their statistics to the stats collector at exit (before the stats
> > collector shutting down)? And the startup process will restore them at
> > restart?
> >
>
> I think we do need to send at the exit but we should probably send at
> some other regular interval as well to avoid losing all the stats
> after the crash-restart situation.

But we clear all statistics collected by the stats collector during
crash recovery. No?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
Hou, Zhijie/侯 志杰
Дата:
On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > >
> > > > Then, if, we proceed in this direction,
> > > > the place to implement those stats
> > > > would be on the LogicalRepWorker struct, instead ?
> > > >
> > >
> > > Or, we can make existing stats persistent and then add these stats on
> > > top of it. Sawada-San, do you have any thoughts on this matter?
> >
> > I think that making existing stats including received_lsn and
> > last_msg_receipt_time persistent by using stats collector could cause
> > massive reporting messages. We can report these messages with a
> > certain interval to reduce the amount of messages but we will end up
> > seeing old stats on the view.
> >
> 
> Can't we keep the current and new stats both in-memory and persist on
> disk? So, the persistent stats data will be used to fill the in-memory
> counters after restarting of workers, otherwise, we will always refer
> to in-memory values.

I think this approach works, but I have another concern about it.

The current pg_stat_subscription view is listed as "Dynamic Statistics Views" in
the document, the data in it seems about the worker process, and the view data
shows what the current worker did. But if we keep the new xact stat persist,
then it's not what the current worker did, it looks more related to the
subscription historic data.

Adding a new view seems resonalble, but it will bring another subscription
related view which might be too much. OTOH, I can see there are already some
different views[1] including xact stat, maybe adding another one is accepatble ?

[1]
pg_stat_xact_all_tables
pg_stat_xact_sys_tables
pg_stat_xact_user_tables
pg_stat_xact_user_functions

Best regards,
Hou zj

Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Sep 30, 2021 at 8:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 29, 2021 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Sep 29, 2021 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > Then, if, we proceed in this direction,
> > > > > > > the place to implement those stats
> > > > > > > would be on the LogicalRepWorker struct, instead ?
> > > > > > >
> > > > > >
> > > > > > Or, we can make existing stats persistent and then add these stats on
> > > > > > top of it. Sawada-San, do you have any thoughts on this matter?
> > > > >
> > > > > I think that making existing stats including received_lsn and
> > > > > last_msg_receipt_time persistent by using stats collector could cause
> > > > > massive reporting messages. We can report these messages with a
> > > > > certain interval to reduce the amount of messages but we will end up
> > > > > seeing old stats on the view.
> > > > >
> > > >
> > > > Can't we keep the current and new stats both in-memory and persist on
> > > > disk? So, the persistent stats data will be used to fill the in-memory
> > > > counters after restarting of workers, otherwise, we will always refer
> > > > to in-memory values.
> > >
> > > Interesting. Probably we can have apply workers and table sync workers
> > > send their statistics to the stats collector at exit (before the stats
> > > collector shutting down)? And the startup process will restore them at
> > > restart?
> > >
> >
> > I think we do need to send at the exit but we should probably send at
> > some other regular interval as well to avoid losing all the stats
> > after the crash-restart situation.
>
> But we clear all statistics collected by the stats collector during
> crash recovery.
>

Right. So, maybe if we do what you are suggesting is sufficient.

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
Osumi, Takamichi/大墨 昂道
Дата:
On Thursday, September 30, 2021 11:53 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
> On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > >
> > > > > Then, if, we proceed in this direction, the place to implement
> > > > > those stats would be on the LogicalRepWorker struct, instead ?
> > > > >
> > > >
> > > > Or, we can make existing stats persistent and then add these stats
> > > > on top of it. Sawada-San, do you have any thoughts on this matter?
> > >
> > > I think that making existing stats including received_lsn and
> > > last_msg_receipt_time persistent by using stats collector could
> > > cause massive reporting messages. We can report these messages with
> > > a certain interval to reduce the amount of messages but we will end
> > > up seeing old stats on the view.
> > >
> >
> > Can't we keep the current and new stats both in-memory and persist on
> > disk? So, the persistent stats data will be used to fill the in-memory
> > counters after restarting of workers, otherwise, we will always refer
> > to in-memory values.
> 
> I think this approach works, but I have another concern about it.
> 
> The current pg_stat_subscription view is listed as "Dynamic Statistics Views"
> in the document, the data in it seems about the worker process, and the view
> data shows what the current worker did. But if we keep the new xact stat
> persist, then it's not what the current worker did, it looks more related to the
> subscription historic data.
> 
> Adding a new view seems resonalble, but it will bring another subscription
> related view which might be too much. OTOH, I can see there are already some
> different views[1] including xact stat, maybe adding another one is
> accepatble ?
I think we'll try to suppress the increment of the numbers
of subscription related stats, if the possibility is not denied.

In terms of the document you mentioned,
I feel I'd need some modifications to it in the patch,
based on the change.


Best Regards,
    Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
>
> On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Can't we keep the current and new stats both in-memory and persist on
> > disk? So, the persistent stats data will be used to fill the in-memory
> > counters after restarting of workers, otherwise, we will always refer
> > to in-memory values.
>
> I think this approach works, but I have another concern about it.
>
> The current pg_stat_subscription view is listed as "Dynamic Statistics Views" in
> the document, the data in it seems about the worker process, and the view data
> shows what the current worker did. But if we keep the new xact stat persist,
> then it's not what the current worker did, it looks more related to the
> subscription historic data.
>

I see your point.

> Adding a new view seems resonalble, but it will bring another subscription
> related view which might be too much. OTOH, I can see there are already some
> different views[1] including xact stat, maybe adding another one is accepatble ?
>

These all views are related to untransmitted to the collector but what
we really need is a view similar to pg_stat_archiver or
pg_stat_bgwriter which gives information about background workers.
Now, the problem as I see is if we go that route then
pg_stat_subscription will no longer remain dynamic view and one might
consider that as a compatibility break. The other idea I have shared
is that we display these stats under the new view introduced by
Sawada-San's patch [1] and probably rename that view as
pg_stat_subscription_worker where all the stats (xact info and last
failure information) about each worker will be displayed. Do you have
any opinion on that idea or do you see any problem with it?

Sure, we can introduce a new view but I want to avoid it if possible.

[1] - https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com

--
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
Osumi, Takamichi/大墨 昂道
Дата:
On Thursday, September 30, 2021 1:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > Adding a new view seems resonalble, but it will bring another
> > subscription related view which might be too much. OTOH, I can see
> > there are already some different views[1] including xact stat, maybe adding
> another one is accepatble ?
> These all views are related to untransmitted to the collector but what we really
> need is a view similar to pg_stat_archiver or pg_stat_bgwriter which gives
> information about background workers.
> Now, the problem as I see is if we go that route then pg_stat_subscription will
> no longer remain dynamic view and one might consider that as a compatibility
> break. The other idea I have shared is that we display these stats under the new
> view introduced by Sawada-San's patch [1] and probably rename that view as
> pg_stat_subscription_worker where all the stats (xact info and last failure
> information) about each worker will be displayed.
Sorry, all the stats in pg_stat_subscription_worker view ?

There was a discussion that
the xact info should be displayed from pg_stat_subscription
with existing stats in the same (which will be changed to persist),
but when your above proposal comes true,
the list of pg_stat_subscription_worker's columns
will be something like below (when I list up major columns).

- subid, subrelid and some other relation attributes required
- 5 stats values moved from pg_stat_subscription
  received_lsn, last_msg_send_time, last_msg_receipt_time,
  latest_end_lsn, latest_end_time
- xact stats
  xact_commit, xact_commit_bytes,
  xact_error, xact_error_bytes,
  xact_abort, xact_abort_bytes,
- error stats
  datname, command, xid, failure_source,
  failure_count, last_failure, last_failure_message, etc

If this is what you imagined,
what we can do from left values of pg_stat_subscription
would be only finding subscription worker's process id.
Is it OK and is this view's alignment acceptable ?


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Sep 30, 2021 at 1:02 PM Osumi, Takamichi/大墨 昂道
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, September 30, 2021 1:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > Adding a new view seems resonalble, but it will bring another
> > > subscription related view which might be too much. OTOH, I can see
> > > there are already some different views[1] including xact stat, maybe adding
> > another one is accepatble ?
> > These all views are related to untransmitted to the collector but what we really
> > need is a view similar to pg_stat_archiver or pg_stat_bgwriter which gives
> > information about background workers.
> > Now, the problem as I see is if we go that route then pg_stat_subscription will
> > no longer remain dynamic view and one might consider that as a compatibility
> > break. The other idea I have shared is that we display these stats under the new
> > view introduced by Sawada-San's patch [1] and probably rename that view as
> > pg_stat_subscription_worker where all the stats (xact info and last failure
> > information) about each worker will be displayed.
> Sorry, all the stats in pg_stat_subscription_worker view ?
>
> There was a discussion that
> the xact info should be displayed from pg_stat_subscription
> with existing stats in the same (which will be changed to persist),
> but when your above proposal comes true,
> the list of pg_stat_subscription_worker's columns
> will be something like below (when I list up major columns).
>
> - subid, subrelid and some other relation attributes required
> - 5 stats values moved from pg_stat_subscription
>   received_lsn, last_msg_send_time, last_msg_receipt_time,
>   latest_end_lsn, latest_end_time
>

If we go with the view as pg_stat_subscription_worker, we don't need
to move the existing stats of pg_stat_subscription into it. There is a
clear difference between them, all the stats displayed via
pg_stat_subscription are dynamic and won't persist whereas all the
stats corresponding to pg_stat_subscription_worker view will persist.


--
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, September 30, 2021 8:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 30, 2021 at 1:02 PM Osumi, Takamichi/大墨 昂道
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Thursday, September 30, 2021 1:15 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > <amit.kapila16@gmail.com>
> > > wrote:
> > > > Adding a new view seems resonalble, but it will bring another
> > > > subscription related view which might be too much. OTOH, I can see
> > > > there are already some different views[1] including xact stat,
> > > > maybe adding
> > > another one is accepatble ?
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure
> > > information) about each worker will be displayed.
> > Sorry, all the stats in pg_stat_subscription_worker view ?
> >
> > There was a discussion that
> > the xact info should be displayed from pg_stat_subscription with
> > existing stats in the same (which will be changed to persist), but
> > when your above proposal comes true, the list of
> > pg_stat_subscription_worker's columns will be something like below
> > (when I list up major columns).
> >
> > - subid, subrelid and some other relation attributes required
> > - 5 stats values moved from pg_stat_subscription
> >   received_lsn, last_msg_send_time, last_msg_receipt_time,
> >   latest_end_lsn, latest_end_time
> >
> 
> If we go with the view as pg_stat_subscription_worker, we don't need to move
> the existing stats of pg_stat_subscription into it.
Thank you for clarifying this point. I understand.


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"houzj.fnst@fujitsu.com"
Дата:
On Thursday, September 30, 2021 12:15 PM Amit Kapila <amit.kapila16@gmail.com>
> On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Can't we keep the current and new stats both in-memory and persist on
> > > disk? So, the persistent stats data will be used to fill the in-memory
> > > counters after restarting of workers, otherwise, we will always refer
> > > to in-memory values.
> >
> > I think this approach works, but I have another concern about it.
> >
> > The current pg_stat_subscription view is listed as "Dynamic Statistics Views"
> in
> > the document, the data in it seems about the worker process, and the view
> data
> > shows what the current worker did. But if we keep the new xact stat persist,
> > then it's not what the current worker did, it looks more related to the
> > subscription historic data.
> >
> 
> I see your point.
> 
> > Adding a new view seems resonalble, but it will bring another subscription
> > related view which might be too much. OTOH, I can see there are already
> some
> > different views[1] including xact stat, maybe adding another one is
> accepatble ?
> >
> 
> These all views are related to untransmitted to the collector but what
> we really need is a view similar to pg_stat_archiver or
> pg_stat_bgwriter which gives information about background workers.
> Now, the problem as I see is if we go that route then
> pg_stat_subscription will no longer remain dynamic view and one might
> consider that as a compatibility break. The other idea I have shared
> is that we display these stats under the new view introduced by
> Sawada-San's patch [1] and probably rename that view as
> pg_stat_subscription_worker where all the stats (xact info and last
> failure information) about each worker will be displayed. Do you have
> any opinion on that idea or do you see any problem with it?

Personally, I think it seems reasonable to merge the xact stat into the view from
sawada-san's patch.

One problem I noticed is that pg_stat_subscription_error
currently have a 'count' column which show how many times the last error
happened. The xact stat here also have a similar value 'xact_error'. I think we
might need to rename it or merge them into one in some way.

Besides, if we decide to merge xact stat into pg_stat_subscription_error, some column
seems need to be renamed. Maybe like:
error_message => Last_error_message, command=> last_error_command..

Best regards,
Hou zj

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, October 14, 2021 12:54 PM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
> On Thursday, September 30, 2021 12:15 PM Amit Kapila
> <amit.kapila16@gmail.com>
> > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰
> <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > >
> > > > Can't we keep the current and new stats both in-memory and persist
> > > > on disk? So, the persistent stats data will be used to fill the
> > > > in-memory counters after restarting of workers, otherwise, we will
> > > > always refer to in-memory values.
> > >
> > > I think this approach works, but I have another concern about it.
> > >
> > > The current pg_stat_subscription view is listed as "Dynamic Statistics
> Views"
> > in
> > > the document, the data in it seems about the worker process, and the
> > > view
> > data
> > > shows what the current worker did. But if we keep the new xact stat
> > > persist, then it's not what the current worker did, it looks more
> > > related to the subscription historic data.
> > >
> >
> > I see your point.
> >
> > > Adding a new view seems resonalble, but it will bring another
> > > subscription related view which might be too much. OTOH, I can see
> > > there are already
> > some
> > > different views[1] including xact stat, maybe adding another one is
> > accepatble ?
> > >
> >
> > These all views are related to untransmitted to the collector but what
> > we really need is a view similar to pg_stat_archiver or
> > pg_stat_bgwriter which gives information about background workers.
> > Now, the problem as I see is if we go that route then
> > pg_stat_subscription will no longer remain dynamic view and one might
> > consider that as a compatibility break. The other idea I have shared
> > is that we display these stats under the new view introduced by
> > Sawada-San's patch [1] and probably rename that view as
> > pg_stat_subscription_worker where all the stats (xact info and last
> > failure information) about each worker will be displayed. Do you have
> > any opinion on that idea or do you see any problem with it?
> 
> Personally, I think it seems reasonable to merge the xact stat into the view from
> sawada-san's patch.
> 
> One problem I noticed is that pg_stat_subscription_error currently have a
> 'count' column which show how many times the last error happened. The xact
> stat here also have a similar value 'xact_error'. I think we might need to rename
> it or merge them into one in some way.
> 
> Besides, if we decide to merge xact stat into pg_stat_subscription_error, some
> column seems need to be renamed. Maybe like:
> error_message => Last_error_message, command=> last_error_command..
Yeah, we must make them distinguished clearly.

I guessed that you are concerned about
amount of renaming codes that could be a bit large
or you come up with a necessity to consider
the all column names of the pg_stat_subscription_worker
together all at once in advance.

It's because my instant impression is,
when we go with the current xact stats column definitions
(xact_commit, xact_commit_bytes, xact_error, xact_error_bytes,
xact_abort, xact_abort_bytes), the renaming problem can be solved
if I write one additional patch or extend the main patch of
xact stats to handle renaming.
(This can work to keep both threads independent
from each other).

Did you have some concern that cannot be handled by this way ?


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"houzj.fnst@fujitsu.com"
Дата:
On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote:
> On Thursday, October 14, 2021 12:54 PM Hou, Zhijie<houzj.fnst@fujitsu.com> wrote:
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote:
> > > >
> > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > <amit.kapila16@gmail.com>
> > wrote:
> > > > >
> > > > > Can't we keep the current and new stats both in-memory and
> > > > > persist on disk? So, the persistent stats data will be used to
> > > > > fill the in-memory counters after restarting of workers,
> > > > > otherwise, we will always refer to in-memory values.
> > > >
> > > > I think this approach works, but I have another concern about it.
> > > >
> > > > The current pg_stat_subscription view is listed as "Dynamic
> > > > Statistics Views" in
> > > > the document, the data in it seems about the worker process, and
> > > > the view data
> > > > shows what the current worker did. But if we keep the new xact
> > > > stat persist, then it's not what the current worker did, it looks
> > > > more related to the subscription historic data.
> > > >
> > >
> > > I see your point.
> > >
> > > > Adding a new view seems resonalble, but it will bring another
> > > > subscription related view which might be too much. OTOH, I can see
> > > > there are already some
> > > > different views[1] including xact stat, maybe adding another one
> > > > is accepatble ?
> > > >
> > >
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you
> > > have any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into
> > the view from sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error currently
> > have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I
> > think we might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into
> > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> Yeah, we must make them distinguished clearly.
> 
> I guessed that you are concerned about
> amount of renaming codes that could be a bit large or you come up with a
> necessity to consider the all column names of the pg_stat_subscription_worker
> together all at once in advance.
> 
> It's because my instant impression is,
> when we go with the current xact stats column definitions (xact_commit,
> xact_commit_bytes, xact_error, xact_error_bytes, xact_abort,
> xact_abort_bytes), the renaming problem can be solved if I write one
> additional patch or extend the main patch of xact stats to handle renaming.
> (This can work to keep both threads independent from each other).
> 
> Did you have some concern that cannot be handled by this way ?
Hi,

Currently, I don't find some unsolvable issues in this approach.

Best regards,
Hou zj


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Oct 14, 2021 at 9:23 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, September 30, 2021 12:15 PM Amit Kapila <amit.kapila16@gmail.com>
> >
> > These all views are related to untransmitted to the collector but what
> > we really need is a view similar to pg_stat_archiver or
> > pg_stat_bgwriter which gives information about background workers.
> > Now, the problem as I see is if we go that route then
> > pg_stat_subscription will no longer remain dynamic view and one might
> > consider that as a compatibility break. The other idea I have shared
> > is that we display these stats under the new view introduced by
> > Sawada-San's patch [1] and probably rename that view as
> > pg_stat_subscription_worker where all the stats (xact info and last
> > failure information) about each worker will be displayed. Do you have
> > any opinion on that idea or do you see any problem with it?
>
> Personally, I think it seems reasonable to merge the xact stat into the view from
> sawada-san's patch.
>
> One problem I noticed is that pg_stat_subscription_error
> currently have a 'count' column which show how many times the last error
> happened. The xact stat here also have a similar value 'xact_error'. I think we
> might need to rename it or merge them into one in some way.
>
> Besides, if we decide to merge xact stat into pg_stat_subscription_error, some column
> seems need to be renamed. Maybe like:
> error_message => Last_error_message, command=> last_error_command..
>

Don't you think that keeping the view name as
pg_stat_subscription_error would be a bit confusing if it has to
display xact_info? Isn't it better to change it to
pg_stat_subscription_worker or some other worker-specific generic
name?

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"houzj.fnst@fujitsu.com"
Дата:
On Monday, October 18, 2021 6:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 14, 2021 at 9:23 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > <amit.kapila16@gmail.com>
> > >
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you
> > > have any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into
> > the view from sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error currently
> > have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I
> > think we might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into
> > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> >
> 
> Don't you think that keeping the view name as pg_stat_subscription_error
> would be a bit confusing if it has to display xact_info? Isn't it better to change it
> to pg_stat_subscription_worker or some other worker-specific generic name?

Yes, I agreed that rename the view to pg_stat_subscription_worker or some
other worker-specific generic name is better if we decide to move forward
with this approach.

Best regards,
Hou zj

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, October 18, 2021 11:52 AM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
> On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote:
> > On Thursday, October 14, 2021 12:54 PM Hou,
> Zhijie<houzj.fnst@fujitsu.com> wrote:
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote:
> > > > >
> > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > > <amit.kapila16@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Can't we keep the current and new stats both in-memory and
> > > > > > persist on disk? So, the persistent stats data will be used to
> > > > > > fill the in-memory counters after restarting of workers,
> > > > > > otherwise, we will always refer to in-memory values.
> > > > >
> > > > > I think this approach works, but I have another concern about it.
> > > > >
> > > > > The current pg_stat_subscription view is listed as "Dynamic
> > > > > Statistics Views" in the document, the data in it seems about
> > > > > the worker process, and the view data shows what the current
> > > > > worker did. But if we keep the new xact stat persist, then it's
> > > > > not what the current worker did, it looks more related to the
> > > > > subscription historic data.
> > > > >
> > > >
> > > > I see your point.
> > > >
> > > > > Adding a new view seems resonalble, but it will bring another
> > > > > subscription related view which might be too much. OTOH, I can
> > > > > see there are already some different views[1] including xact
> > > > > stat, maybe adding another one is accepatble ?
> > > > >
> > > >
> > > > These all views are related to untransmitted to the collector but
> > > > what we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one
> > > > might consider that as a compatibility break. The other idea I
> > > > have shared is that we display these stats under the new view
> > > > introduced by Sawada-San's patch [1] and probably rename that view
> > > > as pg_stat_subscription_worker where all the stats (xact info and
> > > > last failure information) about each worker will be displayed. Do
> > > > you have any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into
> > > the view from sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error currently
> > > have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'.
> > > I think we might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into
> > > pg_stat_subscription_error, some column seems need to be renamed.
> > Maybe like:
> > > error_message => Last_error_message, command=>
> last_error_command..
> > Yeah, we must make them distinguished clearly.
> >
> > I guessed that you are concerned about amount of renaming codes that
> > could be a bit large or you come up with a necessity to consider the
> > all column names of the pg_stat_subscription_worker together all at
> > once in advance.
> >
> > It's because my instant impression is, when we go with the current
> > xact stats column definitions (xact_commit, xact_commit_bytes,
> > xact_error, xact_error_bytes, xact_abort, xact_abort_bytes), the
> > renaming problem can be solved if I write one additional patch or
> > extend the main patch of xact stats to handle renaming.
> > (This can work to keep both threads independent from each other).
> >
> > Did you have some concern that cannot be handled by this way ?
> Hi,
> 
> Currently, I don't find some unsolvable issues in this approach.
Okay. Glad to hear that.
Then, I can restart my implementation with this direction.


Best Regards,
    Takamichi Osumi

Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 14, 2021 at 9:23 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila <amit.kapila16@gmail.com>
> > >
> > > These all views are related to untransmitted to the collector but what
> > > we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one might
> > > consider that as a compatibility break. The other idea I have shared
> > > is that we display these stats under the new view introduced by
> > > Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you have
> > > any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into the view from
> > sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error
> > currently have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I think we
> > might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into pg_stat_subscription_error, some column
> > seems need to be renamed. Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> >
>
> Don't you think that keeping the view name as
> pg_stat_subscription_error would be a bit confusing if it has to
> display xact_info? Isn't it better to change it to
> pg_stat_subscription_worker or some other worker-specific generic
> name?

I agree that it'd be better to include xact info to
pg_stat_subscription_errors view rather than making
pg_stat_subscription a cumulative view. It would be more simple and
consistent.

The user might want to reset only either error stats or xact stats but
we can do that by passing a value to the reset function. For example,
pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only
error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’)
for resetting only xact stats etc (passing NULL or omitting the third
argument means resetting all stats). I'll change the view name in the
next version patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Oct 21, 2021 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Oct 14, 2021 at 9:23 AM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila <amit.kapila16@gmail.com>
> > > >
> > > > These all views are related to untransmitted to the collector but what
> > > > we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one might
> > > > consider that as a compatibility break. The other idea I have shared
> > > > is that we display these stats under the new view introduced by
> > > > Sawada-San's patch [1] and probably rename that view as
> > > > pg_stat_subscription_worker where all the stats (xact info and last
> > > > failure information) about each worker will be displayed. Do you have
> > > > any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into the view from
> > > sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error
> > > currently have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'. I think we
> > > might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, some column
> > > seems need to be renamed. Maybe like:
> > > error_message => Last_error_message, command=> last_error_command..
> > >
> >
> > Don't you think that keeping the view name as
> > pg_stat_subscription_error would be a bit confusing if it has to
> > display xact_info? Isn't it better to change it to
> > pg_stat_subscription_worker or some other worker-specific generic
> > name?
>
> I agree that it'd be better to include xact info to
> pg_stat_subscription_errors view rather than making
> pg_stat_subscription a cumulative view. It would be more simple and
> consistent.
>
> The user might want to reset only either error stats or xact stats but
> we can do that by passing a value to the reset function. For example,
> pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only
> error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’)
> for resetting only xact stats etc (passing NULL or omitting the third
> argument means resetting all stats).
>

Sounds reasonable.

> I'll change the view name in the
> next version patch.
>

Thanks, that will be helpful.

--
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, October 21, 2021 10:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Oct 14, 2021 at 9:23 AM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > > >
> > > > These all views are related to untransmitted to the collector but
> > > > what we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one
> > > > might consider that as a compatibility break. The other idea I
> > > > have shared is that we display these stats under the new view
> > > > introduced by Sawada-San's patch [1] and probably rename that view
> > > > as pg_stat_subscription_worker where all the stats (xact info and
> > > > last failure information) about each worker will be displayed. Do
> > > > you have any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into
> > > the view from sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error currently
> > > have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'.
> > > I think we might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into
> > > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > > error_message => Last_error_message, command=>
> last_error_command..
> > >
> >
> > Don't you think that keeping the view name as
> > pg_stat_subscription_error would be a bit confusing if it has to
> > display xact_info? Isn't it better to change it to
> > pg_stat_subscription_worker or some other worker-specific generic
> > name?
> 
> I agree that it'd be better to include xact info to pg_stat_subscription_errors
> view rather than making pg_stat_subscription a cumulative view. It would be
> more simple and consistent.
... 
>I'll change the view name in the next version patch.
Thanks a lot, Sawasa-san.

I've created a new patch that extends pg_stat_subscription_workers
to include other transaction statistics.

Note that this patch depends on v18 patch-set in [1]
and needs to be after the perl modules' namespace changes
conducted recently by commit b3b4d8e68ae83f432f43f035c7eb481ef93e1583.

There are other several major changes compared to the previous version.

(1)
Addressing several streaming transactions running in parallel on the pub that
can send unexpected order of partial data demarcated stream start and stream stop.
Even if one of them aborts, now bytes calculation should be correct.

(2)
Updates of stats on the view at either commit prepared or rollback prepared time.
This means we don't lost prepared transaction size even after server restart
and user can see the results of two phase operation at those timings.

(3)
Addition of TAP tests.

(4)
Renames of other existing columns so that those meanings are more apparent
in align with other new stats.

Some important details are written in the comments of the attached patch.

[1] - https://www.postgresql.org/message-id/CAD21AoCTtQgfy57AxB4q8KUOpRH8rkHN%3Ds_9p9Pvno_XoBK5wg%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, October 28, 2021 11:19 PM I wrote:
> I've created a new patch that extends pg_stat_subscription_workers to include
> other transaction statistics.
> 
> Note that this patch depends on v18 patch-set in [1]...
Rebased based on the v19 in [1].
Also, fixed documentation a little and made tests tidy.
FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
because I checked that 100 times of its execution in a tight loop all passed.


[1] - https://www.postgresql.org/message-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, November 1, 2021 10:18 PM I wrote:
> On Thursday, October 28, 2021 11:19 PM I wrote:
> > I've created a new patch that extends pg_stat_subscription_workers to
> > include other transaction statistics.
> >
> > Note that this patch depends on v18 patch-set in [1]...
> Rebased based on the v19 in [1].
I'd like to add one more explanation about the latest patches for review.


On Thursday, October 28, 2021 11:19 PM I wrote:
> (2)
> Updates of stats on the view at either commit prepared or rollback prepared
> time.
> This means we don't lost prepared transaction size even after server restart
> and user can see the results of two phase operation at those timings.
There is another reason I had to treat 'prepare' message like above.
Any reviewer might think that sending prepared bytes to stats collector
and making the bytes survive over the restart is too much.

But, we don't know if the prepared transaction is processed
by commit prepared or rollback prepared at 'prepare' time.
An execution of commit prepared needs to update the column of xact_commit and
xact_commit_bytes while rollback prepared does the column of xact_abort
and xact_abort_bytes where this patch is introducing.

Therefore, even when we can calculate the bytes of prepared txn at prepare time,
it's *not* possible to add the transaction bytes to either stats column of
bytes sizes and clean up the bytes. Then, there was a premise
that we should not lose the prepared txn bytes by the shutdown
so my patch has become the implementation described above.


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Greg Nancarrow
Дата:
On Tue, Nov 2, 2021 at 12:18 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, October 28, 2021 11:19 PM I wrote:
> > I've created a new patch that extends pg_stat_subscription_workers to include
> > other transaction statistics.
> >
> > Note that this patch depends on v18 patch-set in [1]...
> Rebased based on the v19 in [1].
> Also, fixed documentation a little and made tests tidy.
> FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> because I checked that 100 times of its execution in a tight loop all passed.
>

I have done some basic testing of the patch and have some initial
review comments:

(1) I think this patch needs to be in "git format-patch" format, with
a proper commit message that describes the purpose of the patch and
the functionality it adds, and any high-level design points (something
like the overview given in the initial post, and accounting for the
subsequent discussion points and updated functionality).

(2) doc/src/sgml/monitoring.sgml
There are some grammatical issues in the current description. I
suggest changing it to something like:

BEFORE:
+      <entry>At least one row per subscription, showing about
transaction statistics and error summary that
AFTER:
+      <entry>At least one row per subscription, showing transaction
statistics and information about errors that

(2) doc/src/sgml/monitoring.sgml
The current description seems a little confusing.
Per subscription, it shows the transaction statistics and any last
error info from tablesync/apply workers? If this is the case, I'd
suggest the following change:

BEFORE:
+   one row per subscription for transaction statistics and summary of the last
+   error reported by workers applying logical replication changes and workers
+   handling the initial data copy of the subscribed tables.
AFTER:
+   one row per subscription, showing corresponding transaction statistics and
+   information about the last error reported by workers applying
logical replication
+   changes or by workers handling the initial data copy of the
subscribed tables.

(3) xact_commit
I think that the "xact_commit" column should be named
"xact_commit_count" or "xact_commits".
Similarly, I think "xact_error" should be named "xact_error_count" or
"xact_errors", and "xact_aborts" should be named "xact_abort_count" or
"xact_aborts".

(4) xact_commit_bytes

+       Amount of transactions data successfully applied in this subscription.
+       Consumed memory for xact_commit is displayed.

I find this description a bit confusing. "Consumed memory for
xact_commit" seems different to "transactions data".
Could the description be something like: Amount of data (in bytes)
successfully applied in this subscription, across "xact_commit_count"
transactions.

(5)
I'd suggest some minor rewording for the following:

BEFORE:
+       Number of transactions failed to be applied and caught by table
+       sync worker or main apply worker in this subscription.
AFTER:
+       Number of transactions that failed to be applied by the table
+       sync worker or main apply worker in this subscription.

(6) xact_error_bytes
Again, it's a little confusing referring to "consumed memory" here.
How about rewording this, something like:

BEFORE:
+       Amount of transactions data unsuccessfully applied in this subscription.
+       Consumed memory that past failed transaction used is displayed.
AFTER:
+       Amount of data (in bytes) unsuccessfully applied in this
subscription by the last failed transaction.

(7)
The additional information provided for "xact_abort_bytes" needs some
rewording, something like:

BEFORE:
+       Increase <literal>logical_decoding_work_mem</literal> on the publisher
+       so that it exceeds the size of whole streamed transaction
+       to suppress unnecessary consumed network bandwidth in addition to change
+       in memory of the subscriber, if unexpected amount of streamed
transactions
+       are aborted.
AFTER:
+       In order to suppress unnecessary consumed network bandwidth, increase
+       <literal>logical_decoding_work_mem</literal> on the publisher so that it
+       exceeds the size of the whole streamed transaction, and
additionally increase
+       the available subscriber memory, if an unexpected amount of
streamed transactions
+       are aborted.

(8)
Suggested update:

BEFORE:
+ *  Tell the collector that worker transaction has finished without problem.
AFTER:
+ *  Tell the collector that the worker transaction has successfully completed.

(9) src/backend/postmaster/pgstat.c
I think that the GID copying is unnecessarily copying the whole GID
buffer or using an additional strlen().
It should be changed to use strlcpy() to match other code:

BEFORE:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+   command == LOGICAL_REP_MSG_STREAM_PREPARE)
+   memcpy(msg.m_gid, prepare_data->gid, GIDSIZE);
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+   memcpy(msg.m_gid, commit_data->gid, GIDSIZE);
+ else /* rollback prepared */
+   memcpy(msg.m_gid, rollback_data->gid, GIDSIZE);
AFTER:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+   command == LOGICAL_REP_MSG_STREAM_PREPARE)
+   strlcpy(msg.m_gid, prepare_data->gid, sizeof(msg.m_gid));
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+   strlcpy(msg.m_gid, commit_data->gid, sizeof(msg.m_gid));
+ else /* rollback prepared */
+   strlcpy(msg.m_gid, rollback_data->gid, sizeof(msg.m_gid));


BEFORE:
+ strlcpy(prepared_txn->gid, msg->m_gid, strlen(msg->m_gid) + 1);
AFTER:
+ strlcpy(prepared_txn->gid, msg->m_gid, sizeof(prepared_txn->gid));

BEFORE:
+ memcpy(key.gid, msg->m_gid, strlen(msg->m_gid));
AFTER:
+ strlcpy(key.gid, msg->m_gid, sizeof(key.gid));

BEFORE:
+ memcpy(key.gid, gid, strlen(gid));
AFTER:
+ strlcpy(key.gid, gid, sizeof(key.gid));

(10) src/backend/replication/logical/worker.c
Some suggested rewording:

BEFORE:
+ * size of streaming transaction resources because it have used the
AFTER:
+ * size of streaming transaction resources because it has used the


BEFORE:
+ * tradeoff should not be good. Also, add multiple values
+ * at once in order to reduce the number of this function call.
AFTER:
+ * tradeoff would not be good. Also, add multiple values
+ * at once in order to reduce the number of calls to this function.

(11) update_apply_change_size()
Shouldn't this function be declared static?

(12) stream_write_change()

+ streamed_entry->xact_size = streamed_entry->xact_size + total_len;
/* update */

could be simply written as:

+ streamed_entry->xact_size += total_len; /* update */


Regards,
Greg Nancarrow
Fujitsu Australia



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, November 4, 2021 9:54 AM  Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Tue, Nov 2, 2021 at 12:18 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Thursday, October 28, 2021 11:19 PM I wrote:
> > > I've created a new patch that extends pg_stat_subscription_workers
> > > to include other transaction statistics.
> > >
> > > Note that this patch depends on v18 patch-set in [1]...
> > Rebased based on the v19 in [1].
> > Also, fixed documentation a little and made tests tidy.
> > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> > because I checked that 100 times of its execution in a tight loop all passed.
> >
> 
> I have done some basic testing of the patch and have some initial review
> comments:
Thanks for your review !

> (1) I think this patch needs to be in "git format-patch" format, with a proper
> commit message that describes the purpose of the patch and the functionality
> it adds, and any high-level design points (something like the overview given in
> the initial post, and accounting for the subsequent discussion points and
> updated functionality).
Fixed.

> (2) doc/src/sgml/monitoring.sgml
> There are some grammatical issues in the current description. I suggest
> changing it to something like:
> BEFORE:
> +      <entry>At least one row per subscription, showing about
> transaction statistics and error summary that
> AFTER:
> +      <entry>At least one row per subscription, showing transaction
> statistics and information about errors that
Fixed.

> (2) doc/src/sgml/monitoring.sgml
> The current description seems a little confusing.
> Per subscription, it shows the transaction statistics and any last error info from
> tablesync/apply workers? If this is the case, I'd suggest the following change:
> 
> BEFORE:
> +   one row per subscription for transaction statistics and summary of the
> last
> +   error reported by workers applying logical replication changes and
> workers
> +   handling the initial data copy of the subscribed tables.
> AFTER:
> +   one row per subscription, showing corresponding transaction statistics
> and
> +   information about the last error reported by workers applying
> logical replication
> +   changes or by workers handling the initial data copy of the
> subscribed tables.
Fixed.

> (3) xact_commit
> I think that the "xact_commit" column should be named "xact_commit_count"
> or "xact_commits".
> Similarly, I think "xact_error" should be named "xact_error_count" or
> "xact_errors", and "xact_aborts" should be named "xact_abort_count" or
> "xact_aborts".
I prefered *_count. Renamed.

> (4) xact_commit_bytes
> 
> +       Amount of transactions data successfully applied in this subscription.
> +       Consumed memory for xact_commit is displayed.
> 
> I find this description a bit confusing. "Consumed memory for xact_commit"
> seems different to "transactions data".
> Could the description be something like: Amount of data (in bytes) successfully
> applied in this subscription, across "xact_commit_count"
> transactions.
Fixed.

> (5)
> I'd suggest some minor rewording for the following:
> 
> BEFORE:
> +       Number of transactions failed to be applied and caught by table
> +       sync worker or main apply worker in this subscription.
> AFTER:
> +       Number of transactions that failed to be applied by the table
> +       sync worker or main apply worker in this subscription.
Fixed.

> (6) xact_error_bytes
> Again, it's a little confusing referring to "consumed memory" here.
> How about rewording this, something like:
> 
> BEFORE:
> +       Amount of transactions data unsuccessfully applied in this
> subscription.
> +       Consumed memory that past failed transaction used is displayed.
> AFTER:
> +       Amount of data (in bytes) unsuccessfully applied in this
> subscription by the last failed transaction.
xact_error_bytes (and other bytes columns as well) is cumulative
so when a new error happens, the size of this new bytes would be
added to the same. So here we shouldn't mention just the last error.
I simply applied your previous comments of 'xact_commit_bytes'
to 'xact_error_bytes' description.

> (7)
> The additional information provided for "xact_abort_bytes" needs some
> rewording, something like:
> 
> BEFORE:
> +       Increase <literal>logical_decoding_work_mem</literal> on the
> publisher
> +       so that it exceeds the size of whole streamed transaction
> +       to suppress unnecessary consumed network bandwidth in addition to
> change
> +       in memory of the subscriber, if unexpected amount of streamed
> transactions
> +       are aborted.
> AFTER:
> +       In order to suppress unnecessary consumed network bandwidth,
> increase
> +       <literal>logical_decoding_work_mem</literal> on the publisher so
> that it
> +       exceeds the size of the whole streamed transaction, and
> additionally increase
> +       the available subscriber memory, if an unexpected amount of
> streamed transactions
> +       are aborted.

I'm not sure about the last part.
> additionally increase the available subscriber memory,
Which GUC parameter did you mean by this ?
Could we point out and enalrge the memory size only for
subscriber's apply processing intentionally ?
I incorporated (7) except for this last part.
Will revise according to your reply.

I also added the explanation about
xact_abort_bytes itself to align with other bytes columns.

> (8)
> Suggested update:
> 
> BEFORE:
> + *  Tell the collector that worker transaction has finished without problem.
> AFTER:
> + *  Tell the collector that the worker transaction has successfully completed.
Fixed.

> (9) src/backend/postmaster/pgstat.c
> I think that the GID copying is unnecessarily copying the whole GID buffer or
> using an additional strlen().
> It should be changed to use strlcpy() to match other code:
> 
> BEFORE:
> + /* get the gid for this two phase operation */ if (command ==
> + LOGICAL_REP_MSG_PREPARE ||
> +   command == LOGICAL_REP_MSG_STREAM_PREPARE)
> +   memcpy(msg.m_gid, prepare_data->gid, GIDSIZE); else if (command ==
> + LOGICAL_REP_MSG_COMMIT_PREPARED)
> +   memcpy(msg.m_gid, commit_data->gid, GIDSIZE); else /* rollback
> + prepared */
> +   memcpy(msg.m_gid, rollback_data->gid, GIDSIZE);
> AFTER:
> + /* get the gid for this two phase operation */ if (command ==
> + LOGICAL_REP_MSG_PREPARE ||
> +   command == LOGICAL_REP_MSG_STREAM_PREPARE)
> +   strlcpy(msg.m_gid, prepare_data->gid, sizeof(msg.m_gid)); else if
> + (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
> +   strlcpy(msg.m_gid, commit_data->gid, sizeof(msg.m_gid)); else /*
> + rollback prepared */
> +   strlcpy(msg.m_gid, rollback_data->gid, sizeof(msg.m_gid));
Fixed.

> 
> BEFORE:
> + strlcpy(prepared_txn->gid, msg->m_gid, strlen(msg->m_gid) + 1);
> AFTER:
> + strlcpy(prepared_txn->gid, msg->m_gid, sizeof(prepared_txn->gid));
> 
> BEFORE:
> + memcpy(key.gid, msg->m_gid, strlen(msg->m_gid));
> AFTER:
> + strlcpy(key.gid, msg->m_gid, sizeof(key.gid));
> 
> BEFORE:
> + memcpy(key.gid, gid, strlen(gid));
> AFTER:
> + strlcpy(key.gid, gid, sizeof(key.gid));
Fixed.

> (10) src/backend/replication/logical/worker.c
> Some suggested rewording:
> 
> BEFORE:
> + * size of streaming transaction resources because it have used the
> AFTER:
> + * size of streaming transaction resources because it has used the
Fixed.
 
> BEFORE:
> + * tradeoff should not be good. Also, add multiple values
> + * at once in order to reduce the number of this function call.
> AFTER:
> + * tradeoff would not be good. Also, add multiple values
> + * at once in order to reduce the number of calls to this function.
Fixed.

> (11) update_apply_change_size()
> Shouldn't this function be declared static?
Fixed.

> (12) stream_write_change()
> 
> + streamed_entry->xact_size = streamed_entry->xact_size + total_len;
> /* update */
> 
> could be simply written as:
> 
> + streamed_entry->xact_size += total_len; /* update */
Fixed.

Lastly, I removed one unnecessary test that
checked publisher's stats in the TAP tests.
Also I introduced ApplyTxnExtraData structure to
remove void* argument of update_apply_change_size
that might worsen the readability of codes
in the previous version.


Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Fri, Nov 5, 2021 at 1:42 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, November 4, 2021 9:54 AM  Greg Nancarrow <gregn4422@gmail.com> wrote:
> > On Tue, Nov 2, 2021 at 12:18 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Thursday, October 28, 2021 11:19 PM I wrote:
> > > > I've created a new patch that extends pg_stat_subscription_workers
> > > > to include other transaction statistics.
> > > >
> > > > Note that this patch depends on v18 patch-set in [1]...
> > > Rebased based on the v19 in [1].
> > > Also, fixed documentation a little and made tests tidy.
> > > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> > > because I checked that 100 times of its execution in a tight loop all passed.
> > >
> >
> > I have done some basic testing of the patch and have some initial review
> > comments:
> Thanks for your review !
>
> > (1) I think this patch needs to be in "git format-patch" format, with a proper
> > commit message that describes the purpose of the patch and the functionality
> > it adds, and any high-level design points (something like the overview given in
> > the initial post, and accounting for the subsequent discussion points and
> > updated functionality).
> Fixed.
>
> > (2) doc/src/sgml/monitoring.sgml
> > There are some grammatical issues in the current description. I suggest
> > changing it to something like:
> > BEFORE:
> > +      <entry>At least one row per subscription, showing about
> > transaction statistics and error summary that
> > AFTER:
> > +      <entry>At least one row per subscription, showing transaction
> > statistics and information about errors that
> Fixed.
>
> > (2) doc/src/sgml/monitoring.sgml
> > The current description seems a little confusing.
> > Per subscription, it shows the transaction statistics and any last error info from
> > tablesync/apply workers? If this is the case, I'd suggest the following change:
> >
> > BEFORE:
> > +   one row per subscription for transaction statistics and summary of the
> > last
> > +   error reported by workers applying logical replication changes and
> > workers
> > +   handling the initial data copy of the subscribed tables.
> > AFTER:
> > +   one row per subscription, showing corresponding transaction statistics
> > and
> > +   information about the last error reported by workers applying
> > logical replication
> > +   changes or by workers handling the initial data copy of the
> > subscribed tables.
> Fixed.
>
> > (3) xact_commit
> > I think that the "xact_commit" column should be named "xact_commit_count"
> > or "xact_commits".
> > Similarly, I think "xact_error" should be named "xact_error_count" or
> > "xact_errors", and "xact_aborts" should be named "xact_abort_count" or
> > "xact_aborts".
> I prefered *_count. Renamed.
>
> > (4) xact_commit_bytes
> >
> > +       Amount of transactions data successfully applied in this subscription.
> > +       Consumed memory for xact_commit is displayed.
> >
> > I find this description a bit confusing. "Consumed memory for xact_commit"
> > seems different to "transactions data".
> > Could the description be something like: Amount of data (in bytes) successfully
> > applied in this subscription, across "xact_commit_count"
> > transactions.
> Fixed.
>
> > (5)
> > I'd suggest some minor rewording for the following:
> >
> > BEFORE:
> > +       Number of transactions failed to be applied and caught by table
> > +       sync worker or main apply worker in this subscription.
> > AFTER:
> > +       Number of transactions that failed to be applied by the table
> > +       sync worker or main apply worker in this subscription.
> Fixed.
>
> > (6) xact_error_bytes
> > Again, it's a little confusing referring to "consumed memory" here.
> > How about rewording this, something like:
> >
> > BEFORE:
> > +       Amount of transactions data unsuccessfully applied in this
> > subscription.
> > +       Consumed memory that past failed transaction used is displayed.
> > AFTER:
> > +       Amount of data (in bytes) unsuccessfully applied in this
> > subscription by the last failed transaction.
> xact_error_bytes (and other bytes columns as well) is cumulative
> so when a new error happens, the size of this new bytes would be
> added to the same. So here we shouldn't mention just the last error.
> I simply applied your previous comments of 'xact_commit_bytes'
> to 'xact_error_bytes' description.
>
> > (7)
> > The additional information provided for "xact_abort_bytes" needs some
> > rewording, something like:
> >
> > BEFORE:
> > +       Increase <literal>logical_decoding_work_mem</literal> on the
> > publisher
> > +       so that it exceeds the size of whole streamed transaction
> > +       to suppress unnecessary consumed network bandwidth in addition to
> > change
> > +       in memory of the subscriber, if unexpected amount of streamed
> > transactions
> > +       are aborted.
> > AFTER:
> > +       In order to suppress unnecessary consumed network bandwidth,
> > increase
> > +       <literal>logical_decoding_work_mem</literal> on the publisher so
> > that it
> > +       exceeds the size of the whole streamed transaction, and
> > additionally increase
> > +       the available subscriber memory, if an unexpected amount of
> > streamed transactions
> > +       are aborted.
>
> I'm not sure about the last part.
> > additionally increase the available subscriber memory,
> Which GUC parameter did you mean by this ?
> Could we point out and enalrge the memory size only for
> subscriber's apply processing intentionally ?
> I incorporated (7) except for this last part.
> Will revise according to your reply.
>
> I also added the explanation about
> xact_abort_bytes itself to align with other bytes columns.
>
> > (8)
> > Suggested update:
> >
> > BEFORE:
> > + *  Tell the collector that worker transaction has finished without problem.
> > AFTER:
> > + *  Tell the collector that the worker transaction has successfully completed.
> Fixed.
>
> > (9) src/backend/postmaster/pgstat.c
> > I think that the GID copying is unnecessarily copying the whole GID buffer or
> > using an additional strlen().
> > It should be changed to use strlcpy() to match other code:
> >
> > BEFORE:
> > + /* get the gid for this two phase operation */ if (command ==
> > + LOGICAL_REP_MSG_PREPARE ||
> > +   command == LOGICAL_REP_MSG_STREAM_PREPARE)
> > +   memcpy(msg.m_gid, prepare_data->gid, GIDSIZE); else if (command ==
> > + LOGICAL_REP_MSG_COMMIT_PREPARED)
> > +   memcpy(msg.m_gid, commit_data->gid, GIDSIZE); else /* rollback
> > + prepared */
> > +   memcpy(msg.m_gid, rollback_data->gid, GIDSIZE);
> > AFTER:
> > + /* get the gid for this two phase operation */ if (command ==
> > + LOGICAL_REP_MSG_PREPARE ||
> > +   command == LOGICAL_REP_MSG_STREAM_PREPARE)
> > +   strlcpy(msg.m_gid, prepare_data->gid, sizeof(msg.m_gid)); else if
> > + (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
> > +   strlcpy(msg.m_gid, commit_data->gid, sizeof(msg.m_gid)); else /*
> > + rollback prepared */
> > +   strlcpy(msg.m_gid, rollback_data->gid, sizeof(msg.m_gid));
> Fixed.
>
> >
> > BEFORE:
> > + strlcpy(prepared_txn->gid, msg->m_gid, strlen(msg->m_gid) + 1);
> > AFTER:
> > + strlcpy(prepared_txn->gid, msg->m_gid, sizeof(prepared_txn->gid));
> >
> > BEFORE:
> > + memcpy(key.gid, msg->m_gid, strlen(msg->m_gid));
> > AFTER:
> > + strlcpy(key.gid, msg->m_gid, sizeof(key.gid));
> >
> > BEFORE:
> > + memcpy(key.gid, gid, strlen(gid));
> > AFTER:
> > + strlcpy(key.gid, gid, sizeof(key.gid));
> Fixed.
>
> > (10) src/backend/replication/logical/worker.c
> > Some suggested rewording:
> >
> > BEFORE:
> > + * size of streaming transaction resources because it have used the
> > AFTER:
> > + * size of streaming transaction resources because it has used the
> Fixed.
>
> > BEFORE:
> > + * tradeoff should not be good. Also, add multiple values
> > + * at once in order to reduce the number of this function call.
> > AFTER:
> > + * tradeoff would not be good. Also, add multiple values
> > + * at once in order to reduce the number of calls to this function.
> Fixed.
>
> > (11) update_apply_change_size()
> > Shouldn't this function be declared static?
> Fixed.
>
> > (12) stream_write_change()
> >
> > + streamed_entry->xact_size = streamed_entry->xact_size + total_len;
> > /* update */
> >
> > could be simply written as:
> >
> > + streamed_entry->xact_size += total_len; /* update */
> Fixed.
>
> Lastly, I removed one unnecessary test that
> checked publisher's stats in the TAP tests.
> Also I introduced ApplyTxnExtraData structure to
> remove void* argument of update_apply_change_size
> that might worsen the readability of codes
> in the previous version.

Thanks for the updated patch.
Few comments:
1)  You could remove LogicalRepPreparedTxnData,
LogicalRepCommitPreparedTxnData & LogicalRepRollbackPreparedTxnData
and change it to char *gid to reduce the function parameter and
simiplify the assignment:
+ */
+void
+pgstat_report_subworker_twophase_xact(Oid subid, LogicalRepMsgType command,
+
   PgStat_Counter xact_size,
+
   LogicalRepPreparedTxnData *prepare_data,
+
   LogicalRepCommitPreparedTxnData *commit_data,
+
   LogicalRepRollbackPreparedTxnData *rollback_data)


2) Shouldn't this change be part of skip xid patch?
-       TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
+       TupleDescInitEntry(tupdesc, (AttrNumber) 10, "last_error_command",
                                           TEXTOID, -1, 0);
-       TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
+       TupleDescInitEntry(tupdesc, (AttrNumber) 11, "last_error_xid",
                                           XIDOID, -1, 0);
-       TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
+       TupleDescInitEntry(tupdesc, (AttrNumber) 12, "last_error_count",
                                           INT8OID, -1, 0);
-       TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
+       TupleDescInitEntry(tupdesc, (AttrNumber) 13, "last_error_message",
                                           TEXTOID, -1, 0);
-       TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_error_time",
+       TupleDescInitEntry(tupdesc, (AttrNumber) 14, "last_error_time",

3) This newly added structures should be added to typedefs.list:
ApplyTxnExtraData
XactSizeEntry
PgStat_MsgSubWorkerXactEnd
PgStat_MsgSubWorkerTwophaseXact
PgStat_StatSubWorkerPreparedXact
PgStat_StatSubWorkerPreparedXactSize

4) We are not sending the transaction size in case of table sync, is
this intentional, if so I felt we should document this in
pg_stat_subscription_workers
+       /* Report the success of table sync. */
+       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  MyLogicalRepWorker->relid,
+
  0 /* no logical message type */,
+
  0 /* xact size */);
+

5) pg_stat_subscription_workers has a lot of columns, if we can reduce
the column size the readability will improve, like xact_commit_count
to commit_count, xact_commit_bytes to commit_bytes, etc
+    w.xact_commit_count,
+    w.xact_commit_bytes,
+    w.xact_error_count,
+    w.xact_error_bytes,
+    w.xact_abort_count,
+    w.xact_abort_bytes,

Regards,
Vignesh



Re: Failed transaction statistics to measure the logical replication progress

От
Greg Nancarrow
Дата:
On Fri, Nov 5, 2021 at 7:11 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> I'm not sure about the last part.
> > additionally increase the available subscriber memory,
> Which GUC parameter did you mean by this ?
> Could we point out and enalrge the memory size only for
> subscriber's apply processing intentionally ?
> I incorporated (7) except for this last part.
> Will revise according to your reply.
>
I might have misinterpreted your original description, so I'll
re-review that in your latest patch.

As a newer version (v20) of the prerequisite patch was posted a day
ago, it looks like your patch needs to be rebased against that (as it
currently applies on top of the v19 version only).


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Failed transaction statistics to measure the logical replication progress

От
Greg Nancarrow
Дата:
On Fri, Nov 5, 2021 at 7:11 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>

I did a quick scan through the latest v8 patch and noticed the following things:

src/backend/postmaster/pgstat.c

(1) pgstat_recv_subworker_twophase_xact()
The copying from msg->m_gid to key.gid does not seem to be correct.
strlen() is being called on a junk value, since key.gid has not been
assigned yet.
It should be changed as follows:

BEFORE:
+ strlcpy(key.gid, msg->m_gid, strlen(key.gid));
AFTER:
+ strlcpy(key.gid, msg->m_gid, sizeof(key.gid));


(2) pgstat_get_subworker_prepared_txn()
Similar to above, strlen() usage is not correct, and should use
strlcpy() instead of memcpy().

BEFORE:
+ memcpy(key.gid, gid, strlen(key.gid));
AFTER:
+ strlcpy(key.gid, gid, sizeof(key.gid));

(3) stats_reset
Note that the "stats_reset" column has been removed from the
pg_stat_subscription_workers view in the underlying latest v20 patch.


Regards,
Greg Nancarrow
Fujitsu Australia



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Fri, Nov 5, 2021 at 7:11 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> 
> I did a quick scan through the latest v8 patch and noticed the following things:
I appreciate your review !

> src/backend/postmaster/pgstat.c
> 
> (1) pgstat_recv_subworker_twophase_xact()
> The copying from msg->m_gid to key.gid does not seem to be correct.
> strlen() is being called on a junk value, since key.gid has not been assigned yet.
> It should be changed as follows:
> 
> BEFORE:
> + strlcpy(key.gid, msg->m_gid, strlen(key.gid));
> AFTER:
> + strlcpy(key.gid, msg->m_gid, sizeof(key.gid));
Fixed.
 
> (2) pgstat_get_subworker_prepared_txn()
> Similar to above, strlen() usage is not correct, and should use
> strlcpy() instead of memcpy().
> 
> BEFORE:
> + memcpy(key.gid, gid, strlen(key.gid));
> AFTER:
> + strlcpy(key.gid, gid, sizeof(key.gid));
Fixed.

 
> (3) stats_reset
> Note that the "stats_reset" column has been removed from the
> pg_stat_subscription_workers view in the underlying latest v20 patch.
Yes. I've rebased and updated the patch, paying attention to this point.
Attached the updated version.


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, November 8, 2021 3:12 PM vignesh C <vignesh21@gmail.com> wrote:
> On Fri, Nov 5, 2021 at 1:42 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > Lastly, I removed one unnecessary test that checked publisher's stats
> > in the TAP tests.
> > Also I introduced ApplyTxnExtraData structure to remove void* argument
> > of update_apply_change_size that might worsen the readability of codes
> > in the previous version.
> 
> Thanks for the updated patch.
Thanks you for checking my patch !


> Few comments:
> 1)  You could remove LogicalRepPreparedTxnData,
> LogicalRepCommitPreparedTxnData & LogicalRepRollbackPreparedTxnData
> and change it to char *gid to reduce the function parameter and simiplify the
> assignment:
> + */
> +void
> +pgstat_report_subworker_twophase_xact(Oid subid, LogicalRepMsgType
> +command,
> +
>    PgStat_Counter xact_size,
> +
>    LogicalRepPreparedTxnData *prepare_data,
> +
>    LogicalRepCommitPreparedTxnData *commit_data,
> +
>    LogicalRepRollbackPreparedTxnData *rollback_data)
Fixed. 

 
> 2) Shouldn't this change be part of skip xid patch?
> -       TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> +       TupleDescInitEntry(tupdesc, (AttrNumber) 10,
> + "last_error_command",
>                                            TEXTOID, -1, 0);
> -       TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> +       TupleDescInitEntry(tupdesc, (AttrNumber) 11, "last_error_xid",
>                                            XIDOID, -1, 0);
> -       TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> +       TupleDescInitEntry(tupdesc, (AttrNumber) 12, "last_error_count",
>                                            INT8OID, -1, 0);
> -       TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> +       TupleDescInitEntry(tupdesc, (AttrNumber) 13,
> + "last_error_message",
>                                            TEXTOID, -1, 0);
> -       TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_error_time",
> +       TupleDescInitEntry(tupdesc, (AttrNumber) 14, "last_error_time",
Hmm, I didn't think so. Those renames are necessary
to make exisiting columns of skip xid separate from newly-introduced xact stats.
That means, original names of skip xid columns in v20 by itself are fine
and the renames are needed only when this patch gets committed.
At present, we cannot guarantee that this patch will be committed
so I'd like to take care of those renames.

 
> 3) This newly added structures should be added to typedefs.list:
> ApplyTxnExtraData
> XactSizeEntry
> PgStat_MsgSubWorkerXactEnd
> PgStat_MsgSubWorkerTwophaseXact
> PgStat_StatSubWorkerPreparedXact
> PgStat_StatSubWorkerPreparedXactSize
Added.

> 4) We are not sending the transaction size in case of table sync, is this
> intentional, if so I felt we should document this in
> pg_stat_subscription_workers
> +       /* Report the success of table sync. */
> +       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   MyLogicalRepWorker->relid,
> +
>   0 /* no logical message type */,
> +
>   0 /* xact size */);
Right. Updated the doc description.
I added the description in a way that the bytes
stats are only for apply worker.

 
> 5) pg_stat_subscription_workers has a lot of columns, if we can reduce the
> column size the readability will improve, like xact_commit_count to
> commit_count, xact_commit_bytes to commit_bytes, etc
> +    w.xact_commit_count,
> +    w.xact_commit_bytes,
> +    w.xact_error_count,
> +    w.xact_error_bytes,
> +    w.xact_abort_count,
> +    w.xact_abort_bytes,
It makes sense. Those can be somehow redundant.

Tentatively, I renamed only columns' names exported to the users.
This is because changing internal data structure as well (e.g. removing
the PgStat_StatSubWorkerEntry's prefixes) causes duplication name
of 'error_count' members and changing such an internal data structure
of skip xid part will have a huge impact of other parts.
Kindly imagine a case that we add 'last_' prefix to the
all error statistics representing an error of the structure.
If you aren't satisfied with this change, please let me know.

Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, November 9, 2021 8:35 PM I wrote:
> Yes. I've rebased and updated the patch, paying attention to this point.
> Attached the updated version.
Forgot to note one thing.
This is based on the skip xid v20 shared in [1]

[1] - https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Dilip Kumar
Дата:
On Tue, Nov 9, 2021 at 5:05 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > On Fri, Nov 5, 2021 at 7:11 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> >
> > I did a quick scan through the latest v8 patch and noticed the following things:
> I appreciate your review !
>

I have reviewed some part of the patch and I have a few comments

1.
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>error_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of transactions that failed to be applied by the table
+       sync worker or main apply worker in this subscription.
+       </para></entry>
+     </row>

The error_count, should be number of transaction failed to applied? or
it should be number of error?

2.
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>error_bytes</structfield> <type>bigint</type>
+      </para>

How different is error_bytes from the abort_bytes?

3.
+    {
+        size += *extra_data->stream_write_len;
+        add_apply_error_context_xact_size(size);
+        return;
+    }

From apply_handle_insert(), we are calling update_apply_change_size(),
and inside this function we are dereferencing
*extra_data->stream_write_len.  Basically, stream_write_len is in
integer pointer and the caller hasn't allocated memory for that and
inside update_apply_change_size, we are directly dereferencing the
pointer, how this can be correct.  And I also see that in the whole
patch stream_write_len, is never used as lvalue so without storing
anything into this why are we trying to use this as rvalue here?  This
is clearly an issue.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, November 10, 2021 3:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Nov 9, 2021 at 5:05 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow
> <gregn4422@gmail.com> wrote:
> > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > >
> > > I did a quick scan through the latest v8 patch and noticed the following
> things:
> > I appreciate your review !
> I have reviewed some part of the patch and I have a few comments
I really appreciate your attention and review.

> 1.
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>error_count</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of transactions that failed to be applied by the table
> +       sync worker or main apply worker in this subscription.
> +       </para></entry>
> +     </row>
> 
> The error_count, should be number of transaction failed to applied? or it should
> be number of error?
I thought those were same and currently it gets incremented when an error of apply occurs.
Then, it equals to the number of total error. May I have the case when we
get different values between those two ? I can be missing something.

> 2.
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>error_bytes</structfield> <type>bigint</type>
> +      </para>
> 
> How different is error_bytes from the abort_bytes?
By the error_bytes, you can see the consumed resources that
were acquired during apply but the applying processing stopped by some error.
On the other hand, abort_bytes displays bytes used for ROLLBACK PREPARED
and stream_abort processing. That's what I intended.


> 3.
> +    {
> +        size += *extra_data->stream_write_len;
> +        add_apply_error_context_xact_size(size);
> +        return;
> +    }
> 
> From apply_handle_insert(), we are calling update_apply_change_size(), and
> inside this function we are dereferencing *extra_data->stream_write_len.
> Basically, stream_write_len is in integer pointer and the caller hasn't allocated
> memory for that and inside update_apply_change_size, we are directly
> dereferencing the pointer, how this can be correct. 
I'm so sorry to make you confused.

I'll just delete the top part that handles streaming
bytes calculation in the update_apply_change_size().
It's because now that there is a specific structure to recognize each streaming xid
and save transaction size there, which makes the top part in question useless.

> And I also see that in the
> whole patch stream_write_len, is never used as lvalue so without storing
> anything into this why are we trying to use this as rvalue here?  This is clearly
> an issue.
As described above, I'll fix this part and
related codes mainly streaming related codes in the next version.


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Tue, Nov 9, 2021 at 5:05 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
> Yes. I've rebased and updated the patch, paying attention to this point.
> Attached the updated version.

Thanks for the updated patch, few comments:
1) you could rename PgStat_StatSubWorkerPreparedXact to
PgStat_SW_PreparedXactKey or a simpler name which includes key and
similarly change PgStat_StatSubWorkerPreparedXactSize to
PgStat_SW_PreparedXactEntry

+/* prepared transaction */
+typedef struct PgStat_StatSubWorkerPreparedXact
+{
+       Oid                     subid;
+       char            gid[GIDSIZE];
+} PgStat_StatSubWorkerPreparedXact;
+
+typedef struct PgStat_StatSubWorkerPreparedXactSize
+{
+       PgStat_StatSubWorkerPreparedXact key; /* hash key */
+
+       Oid                     subid;
+       char            gid[GIDSIZE];
+       PgStat_Counter xact_size;
+} PgStat_StatSubWorkerPreparedXactSize;
+

2) You can change prepared_size to sw_prepared_xact_entry or
prepared_xact_entry since it is a hash entry with few fields
+       if (subWorkerPreparedXactSizeHash)
+       {
+               PgStat_StatSubWorkerPreparedXactSize *prepared_size;
+
+               hash_seq_init(&hstat, subWorkerPreparedXactSizeHash);
+               while((prepared_size =
(PgStat_StatSubWorkerPreparedXactSize *) hash_seq_search(&hstat)) !=
NULL)
+               {
+                       fputc('P', fpout);
+                       rc = fwrite(prepared_size,
sizeof(PgStat_StatSubWorkerPreparedXactSize), 1, fpout);
+                       (void) rc;                      /* we'll check
for error with ferror */
+               }

3) This need to be indented
-        w.relid,
-        w.command,
-        w.xid,
-        w.error_count,
-        w.error_message,
-        w.last_error_time
+       w.commit_count,
+       w.commit_bytes,
+       w.error_count,
+       w.error_bytes,
+       w.abort_count,
+       w.abort_bytes,
+       w.last_error_relid,
+       w.last_error_command,
+       w.last_error_xid,
+       w.last_error_count,
+       w.last_error_message,
+       w.last_error_time

4) Instead of adding a function to calculate the size, can we move
PartitionTupleRouting from c file to the header file and use sizeof at
the caller function?
+/*
+ * PartitionTupleRoutingSize - exported to calculate total data size
+ * of logical replication mesage apply, because this is one of the
+ * ApplyExecutionData struct members.
+ */
+size_t
+PartitionTupleRoutingSize(void)
+{
+       return sizeof(PartitionTupleRouting);
+}

5) You could run pgindent and pgperltidy for the code and test code to
fix the indent issues.
+
subWorkerPreparedXactSizeHash = hash_create("Subscription worker stats
of prepared txn",
+
                                         PGSTAT_SUBWORKER_HASH_SIZE,
+
                                         &hash_ctl,
+
                                         HASH_ELEM | HASH_STRINGS |
HASH_CONTEXT);

+# There's no entry at the beginning
+my $result = $node_subscriber->safe_psql('postgres',
+"SELECT count(*) FROM pg_stat_subscription_workers;");
+is($result, q(0), 'no entry for transaction stats yet');

6) Few places you have used strlcpy and few places you have used
memcpy, you can keep it consistent:
+       msg.m_command = command;
+       strlcpy(msg.m_gid, gid, sizeof(msg.m_gid));
+       msg.m_xact_bytes = xact_size;

+       key.subid = subid;
+       memcpy(key.gid, gid, sizeof(key.gid));
+       action = (create ? HASH_ENTER : HASH_FIND);

Regards,
Vignesh



Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Wed, Nov 10, 2021 at 3:43 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Nov 9, 2021 at 5:05 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > Yes. I've rebased and updated the patch, paying attention to this point.
> > Attached the updated version.
>
> Thanks for the updated patch, few comments:
> 6) Few places you have used strlcpy and few places you have used
> memcpy, you can keep it consistent:
> +       msg.m_command = command;
> +       strlcpy(msg.m_gid, gid, sizeof(msg.m_gid));
> +       msg.m_xact_bytes = xact_size;
>
> +       key.subid = subid;
> +       memcpy(key.gid, gid, sizeof(key.gid));
> +       action = (create ? HASH_ENTER : HASH_FIND);

Few more comments:
1) Here the tuple length is not considered in the calculation, else it
will always show the fixed size for any size tuple. Ex varchar insert
with 1 byte or varchar insert with 100's of bytes. So I feel we should
include the tuple length in the calculation.
+               case LOGICAL_REP_MSG_INSERT:
+               case LOGICAL_REP_MSG_UPDATE:
+               case LOGICAL_REP_MSG_DELETE:
+                       Assert(extra_data != NULL);
+
+                       /*
+                        * Compute size based on ApplyExecutionData.
+                        * The size of LogicalRepRelMapEntry can be
skipped because
+                        * it is obtained from hash_search in
logicalrep_rel_open.
+                        */
+                       size += sizeof(ApplyExecutionData) + sizeof(EState) +
+                               sizeof(ResultRelInfo) + sizeof(ResultRelInfo);
+
+                       /*
+                        * Add some extra size if the target relation
is partitioned.
+                        * PartitionTupleRouting isn't exported.
Therefore, call the
+                        * function that returns its size instead.
+                        */
+                       if
(extra_data->relmapentry->localrel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE)
+                               size += sizeof(ModifyTableState) +
PartitionTupleRoutingSize();
+                       break;

2) Can this be part of PgStat_StatDBEntry, similar to tables,
functions and subworkers. It might be more appropriate to have it
there instead of having another global variable.
+ * Stats of prepared transactions should be displayed
+ * at either commit prepared or rollback prepared time, even when it's
+ * after the server restart. We have the apply worker send those statistics
+ * to the stats collector at prepare time and the startup process restore
+ * those at restart if necessary.
+ */
+static HTAB *subWorkerPreparedXactSizeHash = NULL;
+
+/*

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, November 11, 2021 9:47 PM vignesh C <vignesh21@gmail.com> wrote:
> Few more comments:
> 1) Here the tuple length is not considered in the calculation, else it will always
> show the fixed size for any size tuple. Ex varchar insert with 1 byte or varchar
> insert with 100's of bytes. So I feel we should include the tuple length in the
> calculation.
> +               case LOGICAL_REP_MSG_INSERT:
> +               case LOGICAL_REP_MSG_UPDATE:
> +               case LOGICAL_REP_MSG_DELETE:
> +                       Assert(extra_data != NULL);
> +
> +                       /*
> +                        * Compute size based on ApplyExecutionData.
> +                        * The size of LogicalRepRelMapEntry can be
> skipped because
> +                        * it is obtained from hash_search in
> logicalrep_rel_open.
> +                        */
> +                       size += sizeof(ApplyExecutionData) + sizeof(EState)
> +
> +                               sizeof(ResultRelInfo) +
> + sizeof(ResultRelInfo);
> +
> +                       /*
> +                        * Add some extra size if the target relation
> is partitioned.
> +                        * PartitionTupleRouting isn't exported.
> Therefore, call the
> +                        * function that returns its size instead.
> +                        */
> +                       if
> (extra_data->relmapentry->localrel->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE)
> +                               size += sizeof(ModifyTableState) +
> PartitionTupleRoutingSize();
> +                       break;
Thanks a lot ! Fixed.


> 2) Can this be part of PgStat_StatDBEntry, similar to tables, functions and
> subworkers. It might be more appropriate to have it there instead of having
> another global variable.
> + * Stats of prepared transactions should be displayed
> + * at either commit prepared or rollback prepared time, even when it's
> + * after the server restart. We have the apply worker send those
> +statistics
> + * to the stats collector at prepare time and the startup process
> +restore
> + * those at restart if necessary.
> + */
> +static HTAB *subWorkerPreparedXactSizeHash = NULL;
> +
> +/*
Fixed. Also, its name was too long
when aligned with other PgStat_StatDBEntry memebers.
Thus I renamed it as subworkers_preparedsizes.

This depends on v21 in [1]

[1] - https://www.postgresql.org/message-id/CAD21AoAkd4YSoQUUFfpcrYOtkPRbninaw3sD0qc77nLW6Q89gg%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, November 10, 2021 7:13 PM vignesh C <vignesh21@gmail.com> wrote:
> On Tue, Nov 9, 2021 at 5:05 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > Yes. I've rebased and updated the patch, paying attention to this point.
> > Attached the updated version.
> 
> Thanks for the updated patch, few comments:
> 1) you could rename PgStat_StatSubWorkerPreparedXact to
> PgStat_SW_PreparedXactKey or a simpler name which includes key and
> similarly change PgStat_StatSubWorkerPreparedXactSize to
> PgStat_SW_PreparedXactEntry
> 
> +/* prepared transaction */
> +typedef struct PgStat_StatSubWorkerPreparedXact {
> +       Oid                     subid;
> +       char            gid[GIDSIZE];
> +} PgStat_StatSubWorkerPreparedXact;
> +
> +typedef struct PgStat_StatSubWorkerPreparedXactSize
> +{
> +       PgStat_StatSubWorkerPreparedXact key; /* hash key */
> +
> +       Oid                     subid;
> +       char            gid[GIDSIZE];
> +       PgStat_Counter xact_size;
> +} PgStat_StatSubWorkerPreparedXactSize;
> +
Fixed. Adopted your suggested names.


> 2) You can change prepared_size to sw_prepared_xact_entry or
> prepared_xact_entry since it is a hash entry with few fields
> +       if (subWorkerPreparedXactSizeHash)
> +       {
> +               PgStat_StatSubWorkerPreparedXactSize *prepared_size;
> +
> +               hash_seq_init(&hstat, subWorkerPreparedXactSizeHash);
> +               while((prepared_size =
> (PgStat_StatSubWorkerPreparedXactSize *) hash_seq_search(&hstat)) !=
> NULL)
> +               {
> +                       fputc('P', fpout);
> +                       rc = fwrite(prepared_size,
> sizeof(PgStat_StatSubWorkerPreparedXactSize), 1, fpout);
> +                       (void) rc;                      /* we'll check
> for error with ferror */
> +               }
I preferred prepared_xact_entry. Fixed.


> 3) This need to be indented
> -        w.relid,
> -        w.command,
> -        w.xid,
> -        w.error_count,
> -        w.error_message,
> -        w.last_error_time
> +       w.commit_count,
> +       w.commit_bytes,
> +       w.error_count,
> +       w.error_bytes,
> +       w.abort_count,
> +       w.abort_bytes,
> +       w.last_error_relid,
> +       w.last_error_command,
> +       w.last_error_xid,
> +       w.last_error_count,
> +       w.last_error_message,
> +       w.last_error_time
Fixed.


> 4) Instead of adding a function to calculate the size, can we move
> PartitionTupleRouting from c file to the header file and use sizeof at the caller
> function?
> +/*
> + * PartitionTupleRoutingSize - exported to calculate total data size
> + * of logical replication mesage apply, because this is one of the
> + * ApplyExecutionData struct members.
> + */
> +size_t
> +PartitionTupleRoutingSize(void)
> +{
> +       return sizeof(PartitionTupleRouting); }
Fixed.


> 5) You could run pgindent and pgperltidy for the code and test code to fix the
> indent issues.
> +
> subWorkerPreparedXactSizeHash = hash_create("Subscription worker stats
> of prepared txn",
> +
> 
> PGSTAT_SUBWORKER_HASH_SIZE,
> +
>                                          &hash_ctl,
> +
>                                          HASH_ELEM | HASH_STRINGS |
> HASH_CONTEXT);
> +# There's no entry at the beginning
> +my $result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM pg_stat_subscription_workers;"); is($result,
> +q(0), 'no entry for transaction stats yet');
Conducted pgindent and pgperltidy.


> 6) Few places you have used strlcpy and few places you have used memcpy,
> you can keep it consistent:
> +       msg.m_command = command;
> +       strlcpy(msg.m_gid, gid, sizeof(msg.m_gid));
> +       msg.m_xact_bytes = xact_size;
> 
> +       key.subid = subid;
> +       memcpy(key.gid, gid, sizeof(key.gid));
> +       action = (create ? HASH_ENTER : HASH_FIND);
Fixed. I used strlcpy for new additional functions I made.
An exception is pgstat_read_db_statsfile().
In this function, we use memcpy() consistently in other places.

Please have a look at [1]

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373FEB287F733C81C1E4D42ED989%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, November 10, 2021 6:13 PM I wrote:
> On Wednesday, November 10, 2021 3:43 PM Dilip Kumar
> <dilipbalaut@gmail.com> wrote:
> > On Tue, Nov 9, 2021 at 5:05 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow
> > <gregn4422@gmail.com> wrote:
> > > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > > >
> > > >
> > > > I did a quick scan through the latest v8 patch and noticed the
> > > > following
> > things:
> > > I appreciate your review !
> > I have reviewed some part of the patch and I have a few comments
> I really appreciate your attention and review.
...
> > 3.
> > +    {
> > +        size += *extra_data->stream_write_len;
> > +        add_apply_error_context_xact_size(size);
> > +        return;
> > +    }
> >
> > From apply_handle_insert(), we are calling update_apply_change_size(),
> > and inside this function we are dereferencing
> *extra_data->stream_write_len.
> > Basically, stream_write_len is in integer pointer and the caller
> > hasn't allocated memory for that and inside update_apply_change_size,
> > we are directly dereferencing the pointer, how this can be correct.
...
> I'll just delete the top part that handles streaming bytes calculation in the
> update_apply_change_size().
> It's because now that there is a specific structure to recognize each streaming
> xid and save transaction size there, which makes the top part in question
> useless.
> 
> > And I also see that in the
> > whole patch stream_write_len, is never used as lvalue so without
> > storing anything into this why are we trying to use this as rvalue
> > here?  This is clearly an issue.
> As described above, I'll fix this part and related codes mainly streaming related
> codes in the next version.
Removed this deadcodes.

Please have a look at [1]

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373FEB287F733C81C1E4D42ED989%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, November 15, 2021 6:28 PM I wrote:
> On Thursday, November 11, 2021 9:47 PM vignesh C <vignesh21@gmail.com>
> wrote:
> > Few more comments:
...
> Fixed. Also, its name was too long
> when aligned with other PgStat_StatDBEntry memebers.
> Thus I renamed it as subworkers_preparedsizes.
> 
> This depends on v21 in [1]
Hi

There was some minor conflict error by the change of v22 in [1].
I've conducted some update for this.
(The rebased part is only C code and checked by pgindent)

[1] - https://www.postgresql.org/message-id/CAD21AoCE72vKXJp99f4xRw7Mh5ve-Z2roe21gP8Y82_CxXKvbg%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, November 15, 2021 9:14 PM I wrote:
> I've conducted some update for this.
> (The rebased part is only C code and checked by pgindent)
I'll update my patches since a new skip xid patch
has been shared in [1].

This version includes some minor renames of functions
that are related to transaction sizes.

[1] - https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Nov 16, 2021 at 9:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, November 15, 2021 9:14 PM I wrote:
> > I've conducted some update for this.
> > (The rebased part is only C code and checked by pgindent)
> I'll update my patches since a new skip xid patch
> has been shared in [1].
>
> This version includes some minor renames of functions
> that are related to transaction sizes.

I've looked at v12-0001 patch. Here are some comments:

-   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
                       OIDOID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command",
                       TEXTOID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
                       XIDOID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
                       INT8OID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message",

If renaming column names clarifies those meanings, the above changes
should be included into my patch that introduces
pg_stat_subscription_workers view?

---
I think that exporting PartitionTupleRouting should not be done in the
one patch together with renaming the view columns. There is not
relevance between them at all. If it's used by v12-0002 patch, I think
it should be included in that patch or in another separate patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Nov 16, 2021 at 9:34 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, November 15, 2021 9:14 PM I wrote:
> > > I've conducted some update for this.
> > > (The rebased part is only C code and checked by pgindent)
> > I'll update my patches since a new skip xid patch has been shared in
> > [1].
> >
> > This version includes some minor renames of functions that are related
> > to transaction sizes.
> 
> I've looked at v12-0001 patch. Here are some comments:
Thank you for paying attention to this thread !


> -   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
>                        OIDOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command",
>                        TEXTOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
>                        XIDOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
>                        INT8OID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message",
> 
> If renaming column names clarifies those meanings, the above changes should
> be included into my patch that introduces pg_stat_subscription_workers
> view?
At first, your column names of pg_stat_subscription_workers look totally OK to me by itself
and I thought I should take care of those renaming at the commit timing of my stats patches.
But, if you agree with the new names above and fixing your patch doesn't
bother you, I'd appreciate your help !

> I think that exporting PartitionTupleRouting should not be done in the one
> patch together with renaming the view columns. There is not relevance
> between them at all. If it's used by v12-0002 patch, I think it should be included
> in that patch or in another separate patch.
Yes, it's used by v12-0002.

Absolutely you are right. When you update the patch like above,
I would like to make it independent.


Best Regards,
    Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Nov 17, 2021 at 9:44 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Tue, Nov 16, 2021 at 9:34 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Monday, November 15, 2021 9:14 PM I wrote:
> > > > I've conducted some update for this.
> > > > (The rebased part is only C code and checked by pgindent)
> > > I'll update my patches since a new skip xid patch has been shared in
> > > [1].
> > >
> > > This version includes some minor renames of functions that are related
> > > to transaction sizes.
> >
> > I've looked at v12-0001 patch. Here are some comments:
> Thank you for paying attention to this thread !
>
>
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
> >                        OIDOID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command",
> >                        TEXTOID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
> >                        XIDOID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
> >                        INT8OID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message",
> >
> > If renaming column names clarifies those meanings, the above changes should
> > be included into my patch that introduces pg_stat_subscription_workers
> > view?

Right.

> At first, your column names of pg_stat_subscription_workers look totally OK to me by itself
> and I thought I should take care of those renaming at the commit timing of my stats patches.
>

Can you please tell us why you think the names in your proposed patch
are better than the existing names proposed in Sawada-San's patch? Is
it because those fields always contain the information of the last or
latest error that occurred in the corresponding subscription worker?
If so, I am not very sure if that is a good reason to increase the
length of most of the column names but if you and others feel that is
helpful then it is better to do it as part of Sawada-San's patch.

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, November 17, 2021 10:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Nov 17, 2021 at 9:44 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > On Tue, Nov 16, 2021 at 9:34 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > On Monday, November 15, 2021 9:14 PM I wrote:
> > > > > I've conducted some update for this.
> > > > > (The rebased part is only C code and checked by pgindent)
> > > > I'll update my patches since a new skip xid patch has been shared
> > > > in [1].
> > > >
> > > > This version includes some minor renames of functions that are
> > > > related to transaction sizes.
> > >
> > > I've looked at v12-0001 patch. Here are some comments:
> > Thank you for paying attention to this thread !
> >
> >
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
> > >                        OIDOID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 4,
> > > + "last_error_command",
> > >                        TEXTOID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
> > >                        XIDOID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
> > >                        INT8OID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 7,
> > > + "last_error_message",
> > >
> > > If renaming column names clarifies those meanings, the above changes
> > > should be included into my patch that introduces
> > > pg_stat_subscription_workers view?
> 
> Right.
> 
> > At first, your column names of pg_stat_subscription_workers look
> > totally OK to me by itself and I thought I should take care of those renaming at
> the commit timing of my stats patches.
> >
> 
> Can you please tell us why you think the names in your proposed patch are
> better than the existing names proposed in Sawada-San's patch? Is it because
> those fields always contain the information of the last or latest error that
> occurred in the corresponding subscription worker?
This is one reason. 

Another big reason comes from the final alignment when we list up all columns of both patches.
The patches in this thread is trying to introduce a column that indicates
cumulative count of error to show all error counts that the worker got in the past.
In this thread, after two or three improvements, this column name has reached to
a simple one 'error_count' (aligned with 'commit_count' and 'abort_count').
Then we need to differentiate what this thread's patch is trying to introduce and
what skip xid patch is introducing.

> If so, I am not very sure if that is a good reason to increase the length of most of
> the column names but if you and others feel that is helpful then it is better to do
> it as part of Sawada-San's patch.
Yes. On this point, it was a mistake that I handled that changes.
It'd be better that Sawada-san takes care of them once the names have been fixed.


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Nov 17, 2021 at 7:12 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, November 17, 2021 10:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Can you please tell us why you think the names in your proposed patch are
> > better than the existing names proposed in Sawada-San's patch? Is it because
> > those fields always contain the information of the last or latest error that
> > occurred in the corresponding subscription worker?
> This is one reason.
>
> Another big reason comes from the final alignment when we list up all columns of both patches.
> The patches in this thread is trying to introduce a column that indicates
> cumulative count of error to show all error counts that the worker got in the past.
>

Okay, I see your point and it makes sense to rename columns after
these other stats. I am not able to come up with any better names than
what is being used here. Sawada-San, do you agree with this, or do let
us know if you have any better ideas?

BTW, I think the way you are computing error_count in
pgstat_recv_subworker_error() doesn't seem correct to me because it
will accumulate the counter/bytes for the same error again and again.
You might want to update these counters after we have checked that the
received error is not the same as the previous one.

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Tue, Nov 16, 2021 at 6:04 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, November 15, 2021 9:14 PM I wrote:
> > I've conducted some update for this.
> > (The rebased part is only C code and checked by pgindent)
> I'll update my patches since a new skip xid patch
> has been shared in [1].
>
> This version includes some minor renames of functions
> that are related to transaction sizes.

Thanks for the updated patch, Few comments:
1) since pgstat_get_subworker_prepared_txn is called from only one
place and create is passed as true, we can remove create function
parameter or the function could be removed.
+ * Return subscription worker entry with the given subscription OID and
+ * gid.
+ * ----------
+ */
+static PgStat_SW_PreparedXactEntry *
+pgstat_get_subworker_prepared_txn(Oid databaseid, Oid subid,
+                                                                 char
*gid, bool create)
+{
+       PgStat_StatDBEntry *dbentry;
+       PgStat_SW_PreparedXactKey key;

 2) Include subworker prepared transactions also
  /*
* Don't create tables/functions/subworkers hashtables for
* uninteresting databases.
*/
if (onlydb != InvalidOid)
{
if (dbbuf.databaseid != onlydb &&
dbbuf.databaseid != InvalidOid)
break;
}

3) Similarly it should be mentioned in:
reset_dbentry_counters function header, pgstat_read_db_statsfile
function header and pgstat_get_db_entry function comments.

4) I felt we can remove "COMMIT of streaming transaction", since only
commit and commit prepared are the user operations. Shall we change it
to "COMMIT and COMMIT PREPARED will increment this counter."
+       <structfield>commit_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of transactions successfully applied in this subscription.
+       COMMIT, COMMIT of streaming transaction and COMMIT PREPARED increments
+       this counter.
+      </para></entry>
+     </row>

5) PgStat_SW_PreparedXactEntry should be before PgStat_SW_PreparedXactKey
 PgStat_StatSubWorkerEntry
 PgStat_StatSubWorkerKey
+PgStat_SW_PreparedXactKey
+PgStat_SW_PreparedXactEntry
 PgStat_StatTabEntry
 PgStat_SubXactStatus

6)  This change is not required
@@ -293,6 +306,7 @@ static inline void cleanup_subxact_info(void);
 static void stream_cleanup_files(Oid subid, TransactionId xid);
 static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
+
 static void stream_close_file(void);

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, November 18, 2021 8:35 PM vignesh C <vignesh21@gmail.com> wrote:
> On Tue, Nov 16, 2021 at 6:04 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, November 15, 2021 9:14 PM I wrote:
> > > I've conducted some update for this.
> > > (The rebased part is only C code and checked by pgindent)
> > I'll update my patches since a new skip xid patch has been shared in
> > [1].
> >
> > This version includes some minor renames of functions that are related
> > to transaction sizes.
> 
> Thanks for the updated patch, Few comments:
Thank you for checking the patches !


> 1) since pgstat_get_subworker_prepared_txn is called from only one place and
> create is passed as true, we can remove create function parameter or the
> function could be removed.
> + * Return subscription worker entry with the given subscription OID and
> + * gid.
> + * ----------
> + */
> +static PgStat_SW_PreparedXactEntry *
> +pgstat_get_subworker_prepared_txn(Oid databaseid, Oid subid,
> +                                                                 char
> *gid, bool create)
> +{
> +       PgStat_StatDBEntry *dbentry;
> +       PgStat_SW_PreparedXactKey key;
Removed the parameter.


>  2) Include subworker prepared transactions also
>   /*
> * Don't create tables/functions/subworkers hashtables for
> * uninteresting databases.
> */
> if (onlydb != InvalidOid)
> {
> if (dbbuf.databaseid != onlydb &&
> dbbuf.databaseid != InvalidOid)
> break;
> }
Fixed.


> 3) Similarly it should be mentioned in:
> reset_dbentry_counters function header, pgstat_read_db_statsfile function
> header and pgstat_get_db_entry function comments.
Fixed.

 
> 4) I felt we can remove "COMMIT of streaming transaction", since only commit
> and commit prepared are the user operations. Shall we change it to "COMMIT
> and COMMIT PREPARED will increment this counter."
> +       <structfield>commit_count</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of transactions successfully applied in this subscription.
> +       COMMIT, COMMIT of streaming transaction and COMMIT
> PREPARED increments
> +       this counter.
> +      </para></entry>
> +     </row>
You are right ! Fixed.

 
> 5) PgStat_SW_PreparedXactEntry should be before
> PgStat_SW_PreparedXactKey  PgStat_StatSubWorkerEntry
> PgStat_StatSubWorkerKey
> +PgStat_SW_PreparedXactKey
> +PgStat_SW_PreparedXactEntry
>  PgStat_StatTabEntry
>  PgStat_SubXactStatus
Fixed.

> 6)  This change is not required
> @@ -293,6 +306,7 @@ static inline void cleanup_subxact_info(void);  static
> void stream_cleanup_files(Oid subid, TransactionId xid);  static void
> stream_open_file(Oid subid, TransactionId xid, bool first);  static void
> stream_write_change(char action, StringInfo s);
> +
>  static void stream_close_file(void);
Removed.


Other changes are
1. refined the commit message of v13-0003*.
2. made the additional comment for ApplyErrorCallbackArg simple.
3. wrote more explanations about update_apply_change_size() as comment.
4. changed the behavior of pgstat_recv_subworker_error so that
   it can store stats info only once per error.
5. added one simple test for PREPARE and COMMIT PREPARED.

This used v23 skip xid patch [1].
(I will remove v13-0001* when the column names are fixed
and Sawada-san starts to take care of the column name definitions)

[1] - https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, November 18, 2021 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> BTW, I think the way you are computing error_count in
> pgstat_recv_subworker_error() doesn't seem correct to me because it will
> accumulate the counter/bytes for the same error again and again.
> You might want to update these counters after we have checked that the
> received error is not the same as the previous one.
Thank you for your comments !
This is addressed by v13 patchset [1]


[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373533A5C24BDDA516DA7E1ED9B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 7:12 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Wednesday, November 17, 2021 10:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Can you please tell us why you think the names in your proposed patch are
> > > better than the existing names proposed in Sawada-San's patch? Is it because
> > > those fields always contain the information of the last or latest error that
> > > occurred in the corresponding subscription worker?
> > This is one reason.
> >
> > Another big reason comes from the final alignment when we list up all columns of both patches.
> > The patches in this thread is trying to introduce a column that indicates
> > cumulative count of error to show all error counts that the worker got in the past.
> >
>
> Okay, I see your point and it makes sense to rename columns after
> these other stats. I am not able to come up with any better names than
> what is being used here. Sawada-San, do you agree with this, or do let
> us know if you have any better ideas?
>

I'm concerned that these new names will introduce confusion; if we
have last_error_relid, last_error_command, last_error_message,
last_error_time, and last_error_xid, I think users might think that
first_error_time is the timestamp at which an error occurred for the
first time in the subscription worker. Also, I'm not sure
last_error_count is not clear to me (it looks like showing something
count but the only "last" one?). An alternative idea would be to add
total_error_count by this patch, resulting in having both error_count
and total_error_count. Regarding commit_count and abort_count, I
personally think xact_commit and xact_rollback would be better since
they’re more consistent with pg_stat_database view, although we might
already have discussed that.

Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
error_bytes are. I originally thought these statistics track the size
of received data, i.g., how much data is transferred from the
publisher and processed on the subscriber. But what the view currently
has is how much memory is used in the subscription worker. The
subscription worker emulates ReorderBufferChangeSize() on the
subscriber side but, as the comment of update_apply_change_size()
mentions, the size in the view is not accurate:

+ * The byte size of transaction on the publisher is calculated by
+ * ReorderBufferChangeSize() based on the ReorderBufferChange structure.
+ * But on the subscriber, consumed resources are not same as the
+ * publisher's decoding processsing and required to be computed in
+ * different way. Therefore, the exact same byte size is not restored on
+ * the subscriber usually.

Also, it seems to take into account the size of FlushPosition that is
not taken into account by ReorderBufferChangeSize().

I guess that the purpose of these values is to compare them to
total_bytes, stream_byte, and spill_bytes but if the calculation is
not accurate, does it mean that the more stats are updated, the more
the stats will be getting inaccurate?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 17, 2021 at 7:12 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > Can you please tell us why you think the names in your proposed patch are
> > > > better than the existing names proposed in Sawada-San's patch? Is it because
> > > > those fields always contain the information of the last or latest error that
> > > > occurred in the corresponding subscription worker?
> > > This is one reason.
> > >
> > > Another big reason comes from the final alignment when we list up all columns of both patches.
> > > The patches in this thread is trying to introduce a column that indicates
> > > cumulative count of error to show all error counts that the worker got in the past.
> > >
> >
> > Okay, I see your point and it makes sense to rename columns after
> > these other stats. I am not able to come up with any better names than
> > what is being used here. Sawada-San, do you agree with this, or do let
> > us know if you have any better ideas?
> >
>
> I'm concerned that these new names will introduce confusion; if we
> have last_error_relid, last_error_command, last_error_message,
> last_error_time, and last_error_xid, I think users might think that
> first_error_time is the timestamp at which an error occurred for the
> first time in the subscription worker.
>

Isn't to some extent that confusion already exists because of
last_error_time column?

> Also, I'm not sure
> last_error_count is not clear to me (it looks like showing something
> count but the only "last" one?).
>

I feel if all the error related columns have "last_error_" as a prefix
then it should not be that confusing?

> An alternative idea would be to add
> total_error_count by this patch, resulting in having both error_count
> and total_error_count. Regarding commit_count and abort_count, I
> personally think xact_commit and xact_rollback would be better since
> they’re more consistent with pg_stat_database view, although we might
> already have discussed that.
>

Even if we decide to change the column names to
xact_commit/xact_rollback, I think with additional non-error columns
it will be clear to add 'error' in column names corresponding to error
columns, and last_error_* seems to be consistent with what we have in
pg_stat_archiver (last_failed_wal, last_failed_time). Your point
related to first_error_time has merit and I don't have a better answer
for it. I think it is just a convenience column and we are not sure
whether that will be required in practice so maybe we can drop that
column and come back to it later once we get some field feedback on
this view?

--
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 17, 2021 at 7:12 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > Can you please tell us why you think the names in your proposed patch are
> > > > better than the existing names proposed in Sawada-San's patch? Is it because
> > > > those fields always contain the information of the last or latest error that
> > > > occurred in the corresponding subscription worker?
> > > This is one reason.
> > >
> > > Another big reason comes from the final alignment when we list up all columns of both patches.
> > > The patches in this thread is trying to introduce a column that indicates
> > > cumulative count of error to show all error counts that the worker got in the past.
> > >
> >
> > Okay, I see your point and it makes sense to rename columns after
> > these other stats. I am not able to come up with any better names than
> > what is being used here. Sawada-San, do you agree with this, or do let
> > us know if you have any better ideas?
> >
>
> I'm concerned that these new names will introduce confusion; if we
> have last_error_relid, last_error_command, last_error_message,
> last_error_time, and last_error_xid, I think users might think that
> first_error_time is the timestamp at which an error occurred for the
> first time in the subscription worker. Also, I'm not sure
> last_error_count is not clear to me (it looks like showing something
> count but the only "last" one?). An alternative idea would be to add
> total_error_count by this patch, resulting in having both error_count
> and total_error_count. Regarding commit_count and abort_count, I
> personally think xact_commit and xact_rollback would be better since
> they’re more consistent with pg_stat_database view, although we might
> already have discussed that.
>
> Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
> error_bytes are. I originally thought these statistics track the size
> of received data, i.g., how much data is transferred from the
> publisher and processed on the subscriber. But what the view currently
> has is how much memory is used in the subscription worker. The
> subscription worker emulates ReorderBufferChangeSize() on the
> subscriber side but, as the comment of update_apply_change_size()
> mentions, the size in the view is not accurate:
>
> + * The byte size of transaction on the publisher is calculated by
> + * ReorderBufferChangeSize() based on the ReorderBufferChange structure.
> + * But on the subscriber, consumed resources are not same as the
> + * publisher's decoding processsing and required to be computed in
> + * different way. Therefore, the exact same byte size is not restored on
> + * the subscriber usually.
>
> Also, it seems to take into account the size of FlushPosition that is
> not taken into account by ReorderBufferChangeSize().

Let's keep the size calculation similar to the publisher side to avoid
any confusion, we can try to keep it the same as
ReorderBufferChangeSize wherever possible.

Regards,
Vignesh



Re: Failed transaction statistics to measure the logical replication progress

От
Greg Nancarrow
Дата:
On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I'm concerned that these new names will introduce confusion; if we
> have last_error_relid, last_error_command, last_error_message,
> last_error_time, and last_error_xid, I think users might think that
> first_error_time is the timestamp at which an error occurred for the
> first time in the subscription worker.

You mean you think users might think "first_error_time" is the
timestamp at which the last_error first occurred (rather than the
timestamp of the first of any type of error that occurred) on that
worker?

> ... Also, I'm not sure
> last_error_count is not clear to me (it looks like showing something
> count but the only "last" one?).

It's the number of times that the last_error has occurred.
Unless it's some kind of transient error, that might get resolved
without intervention, logical replication will get stuck in a loop
retrying and the last error will occur again and again, hence the
count of how many times that has happened.
Maybe there's not much benefit in counting different errors prior to
the last error?


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Sat, Nov 20, 2021 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > Can you please tell us why you think the names in your proposed patch are
> > > > > better than the existing names proposed in Sawada-San's patch? Is it because
> > > > > those fields always contain the information of the last or latest error that
> > > > > occurred in the corresponding subscription worker?
> > > > This is one reason.
> > > >
> > > > Another big reason comes from the final alignment when we list up all columns of both patches.
> > > > The patches in this thread is trying to introduce a column that indicates
> > > > cumulative count of error to show all error counts that the worker got in the past.
> > > >
> > >
> > > Okay, I see your point and it makes sense to rename columns after
> > > these other stats. I am not able to come up with any better names than
> > > what is being used here. Sawada-San, do you agree with this, or do let
> > > us know if you have any better ideas?
> > >
> >
> > I'm concerned that these new names will introduce confusion; if we
> > have last_error_relid, last_error_command, last_error_message,
> > last_error_time, and last_error_xid, I think users might think that
> > first_error_time is the timestamp at which an error occurred for the
> > first time in the subscription worker.
> >
>
> Isn't to some extent that confusion already exists because of
> last_error_time column?
>
> > Also, I'm not sure
> > last_error_count is not clear to me (it looks like showing something
> > count but the only "last" one?).
> >
>
> I feel if all the error related columns have "last_error_" as a prefix
> then it should not be that confusing?
>
> > An alternative idea would be to add
> > total_error_count by this patch, resulting in having both error_count
> > and total_error_count. Regarding commit_count and abort_count, I
> > personally think xact_commit and xact_rollback would be better since
> > they’re more consistent with pg_stat_database view, although we might
> > already have discussed that.
> >
>
> Even if we decide to change the column names to
> xact_commit/xact_rollback, I think with additional non-error columns
> it will be clear to add 'error' in column names corresponding to error
> columns, and last_error_* seems to be consistent with what we have in
> pg_stat_archiver (last_failed_wal, last_failed_time).

Okay, I agree that last_error_* columns will be consistent.

> Your point
> related to first_error_time has merit and I don't have a better answer
> for it. I think it is just a convenience column and we are not sure
> whether that will be required in practice so maybe we can drop that
> column and come back to it later once we get some field feedback on
> this view?

+1

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Nov 23, 2021 at 3:21 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I'm concerned that these new names will introduce confusion; if we
> > have last_error_relid, last_error_command, last_error_message,
> > last_error_time, and last_error_xid, I think users might think that
> > first_error_time is the timestamp at which an error occurred for the
> > first time in the subscription worker.
>
> You mean you think users might think "first_error_time" is the
> timestamp at which the last_error first occurred (rather than the
> timestamp of the first of any type of error that occurred) on that
> worker?

I felt that "first_error_time" is the timestamp of the first of any
type of error that occurred on the worker.

>
> > ... Also, I'm not sure
> > last_error_count is not clear to me (it looks like showing something
> > count but the only "last" one?).
>
> It's the number of times that the last_error has occurred.
> Unless it's some kind of transient error, that might get resolved
> without intervention, logical replication will get stuck in a loop
> retrying and the last error will occur again and again, hence the
> count of how many times that has happened.
> Maybe there's not much benefit in counting different errors prior to
> the last error?

The name "last_error_count" is somewhat clear to me now. I had felt
that since the last error refers to *one* error that occurred last and
it’s odd there is the count of it.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Friday, November 19, 2021 11:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
> error_bytes are. I originally thought these statistics track the size of received
> data, i.g., how much data is transferred from the publisher and processed on
> the subscriber. But what the view currently has is how much memory is used in
> the subscription worker. The subscription worker emulates
> ReorderBufferChangeSize() on the subscriber side but, as the comment of
> update_apply_change_size() mentions, the size in the view is not accurate:
...
> I guess that the purpose of these values is to compare them to total_bytes,
> stream_byte, and spill_bytes but if the calculation is not accurate, does it mean
> that the more stats are updated, the more the stats will be getting inaccurate?
Thanks for your comment !

I tried to solve your concerns about byte columns but there are really difficult issues to solve.
For example, to begin with the messages of apply worker are different from those of
reorder buffer.
 
Therefore, I decided to split the previous patch and make counter columns go first.
v14 was checked by pgperltidy and pgindent.

This patch can be applied to the PG whose commit id is after 8d74fc9 (introduction of
pg_stat_subscription_workers).

Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Wed, Dec 1, 2021 at 3:04 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Friday, November 19, 2021 11:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
> > error_bytes are. I originally thought these statistics track the size of received
> > data, i.g., how much data is transferred from the publisher and processed on
> > the subscriber. But what the view currently has is how much memory is used in
> > the subscription worker. The subscription worker emulates
> > ReorderBufferChangeSize() on the subscriber side but, as the comment of
> > update_apply_change_size() mentions, the size in the view is not accurate:
> ...
> > I guess that the purpose of these values is to compare them to total_bytes,
> > stream_byte, and spill_bytes but if the calculation is not accurate, does it mean
> > that the more stats are updated, the more the stats will be getting inaccurate?
> Thanks for your comment !
>
> I tried to solve your concerns about byte columns but there are really difficult issues to solve.
> For example, to begin with the messages of apply worker are different from those of
> reorder buffer.
>
> Therefore, I decided to split the previous patch and make counter columns go first.
> v14 was checked by pgperltidy and pgindent.
>
> This patch can be applied to the PG whose commit id is after 8d74fc9 (introduction of
> pg_stat_subscription_workers).

Thanks for the updated patch.
Currently we are storing the commit count, error_count and abort_count
for each table of the table sync operation. If we have thousands of
tables, we will be storing the information for each of the tables.
Shouldn't we be storing the consolidated information in this case.
diff --git a/src/backend/replication/logical/tablesync.c
b/src/backend/replication/logical/tablesync.c
index f07983a..02e9486 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1149,6 +1149,11 @@ copy_table_done:
        MyLogicalRepWorker->relstate_lsn = *origin_startpos;
        SpinLockRelease(&MyLogicalRepWorker->relmutex);

+       /* Report the success of table sync. */
+       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  MyLogicalRepWorker->relid,
+
  0 /* no logical message type */ );

postgres=# select * from pg_stat_subscription_workers ;
 subid | subname | subrelid | commit_count | error_count | abort_count
| last_error_relid | last_error_command | last_error_xid |
last_error_count | last_error_message | last_error_time

-------+---------+----------+--------------+-------------+-------------+------------------+--------------------+----------------+------------------+--------------------+-----------------
 16411 | sub1    |    16387 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16396 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16390 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16393 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16402 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16408 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16384 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16399 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
 16411 | sub1    |    16405 |            1 |           0 |           0
|                  |                    |                |
   0 |                    |
(9 rows)

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Friday, December 3, 2021 3:12 PM vignesh C <vignesh21@gmail.com> wrote:
> Thanks for the updated patch.
> Currently we are storing the commit count, error_count and abort_count for
> each table of the table sync operation. If we have thousands of tables, we will
> be storing the information for each of the tables.
> Shouldn't we be storing the consolidated information in this case.
> diff --git a/src/backend/replication/logical/tablesync.c
> b/src/backend/replication/logical/tablesync.c
> index f07983a..02e9486 100644
> --- a/src/backend/replication/logical/tablesync.c
> +++ b/src/backend/replication/logical/tablesync.c
> @@ -1149,6 +1149,11 @@ copy_table_done:
>         MyLogicalRepWorker->relstate_lsn = *origin_startpos;
>         SpinLockRelease(&MyLogicalRepWorker->relmutex);
> 
> +       /* Report the success of table sync. */
> +       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   MyLogicalRepWorker->relid,
> +
>   0 /* no logical message type */ );
Okay.

I united all stats into that of apply worker.
In line with this change, I fixed the TAP tests as well
to cover the updates of stats done by table sync workers.

Also, during my self-review, I noticed that
I should call pgstat_report_subworker_xact_end() before
process_syncing_tables() because it can lead to process
exit, which results in missing one increment of the stats columns.
I noted this point in a comment as well.

Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Sat, Dec 4, 2021 at 6:32 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Friday, December 3, 2021 3:12 PM vignesh C <vignesh21@gmail.com> wrote:
> > Thanks for the updated patch.
> > Currently we are storing the commit count, error_count and abort_count for
> > each table of the table sync operation. If we have thousands of tables, we will
> > be storing the information for each of the tables.
> > Shouldn't we be storing the consolidated information in this case.
> > diff --git a/src/backend/replication/logical/tablesync.c
> > b/src/backend/replication/logical/tablesync.c
> > index f07983a..02e9486 100644
> > --- a/src/backend/replication/logical/tablesync.c
> > +++ b/src/backend/replication/logical/tablesync.c
> > @@ -1149,6 +1149,11 @@ copy_table_done:
> >         MyLogicalRepWorker->relstate_lsn = *origin_startpos;
> >         SpinLockRelease(&MyLogicalRepWorker->relmutex);
> >
> > +       /* Report the success of table sync. */
> > +       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> > +
> >   MyLogicalRepWorker->relid,
> > +
> >   0 /* no logical message type */ );
> Okay.
>
> I united all stats into that of apply worker.
> In line with this change, I fixed the TAP tests as well
> to cover the updates of stats done by table sync workers.
>
> Also, during my self-review, I noticed that
> I should call pgstat_report_subworker_xact_end() before
> process_syncing_tables() because it can lead to process
> exit, which results in missing one increment of the stats columns.
> I noted this point in a comment as well.

Thanks for the updated patch, few comments:
1) We can keep the documentation similar to mention the count includes
both table sync worker / main apply worker in case of
commit_count/error_count and abort_count to keep it consistent.
+       <structfield>commit_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of transactions successfully applied in this subscription.
+       COMMIT and COMMIT PREPARED increments this counter.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>error_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of transactions that failed to be applied by the table
+       sync worker or main apply worker in this subscription.
+       </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>abort_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of transactions aborted in this subscription.
+       ROLLBACK PREPARED increments this counter.
+      </para></entry>
+     </row>

2) Can this be changed:
+       /*
+        * If this is a new error reported by table sync worker,
consolidate this
+        * error count into the entry of apply worker.
+        */
+       if (OidIsValid(msg->m_subrelid))
+       {
+               /* Gain the apply worker stats */
+               subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
                    InvalidOid, true);
+               subwentry->error_count++;
+       }
+       else
+               subwentry->error_count++;       /* increment the apply
worker's counter. */
To:
+       /*
+        * If this is a new error reported by table sync worker,
consolidate this
+        * error count into the entry of apply worker.
+        */
+       if (OidIsValid(msg->m_subrelid))
+               /* Gain the apply worker stats */
+               subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
                    InvalidOid, true);
+
+       subwentry->error_count++;       /* increment the apply
worker's counter. */

3) Since both 026_worker_stats and 027_worker_xact_stats.pl are
testing pg_stat_subscription_workers, can we move the tests to
026_worker_stats.pl. If possible the error_count validation can be
combined with the existing tests.
diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
b/src/test/subscription/t/027_worker_xact_stats.pl
new file mode 100644
index 0000000..31dbea1
--- /dev/null
+++ b/src/test/subscription/t/027_worker_xact_stats.pl
@@ -0,0 +1,162 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Tests for subscription worker statistics during apply.
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 1;
+
+# Create publisher node

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, December 6, 2021 11:27 PM vignesh C <vignesh21@gmail.com> wrote:
> Thanks for the updated patch, few comments:
Thank you for your review !

> 1) We can keep the documentation similar to mention the count includes both
> table sync worker / main apply worker in case of commit_count/error_count
> and abort_count to keep it consistent.
> +       <structfield>commit_count</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of transactions successfully applied in this subscription.
> +       COMMIT and COMMIT PREPARED increments this counter.
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>error_count</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of transactions that failed to be applied by the table
> +       sync worker or main apply worker in this subscription.
> +       </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>abort_count</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of transactions aborted in this subscription.
> +       ROLLBACK PREPARED increments this counter.
> +      </para></entry>
> +     </row>
Yeah, you are right. Fixed.
Note that abort_count is not used by table sync worker.

 
> 2) Can this be changed:
> +       /*
> +        * If this is a new error reported by table sync worker,
> consolidate this
> +        * error count into the entry of apply worker.
> +        */
> +       if (OidIsValid(msg->m_subrelid))
> +       {
> +               /* Gain the apply worker stats */
> +               subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
>                     InvalidOid, true);
> +               subwentry->error_count++;
> +       }
> +       else
> +               subwentry->error_count++;       /* increment the apply
> worker's counter. */
> To:
> +       /*
> +        * If this is a new error reported by table sync worker,
> consolidate this
> +        * error count into the entry of apply worker.
> +        */
> +       if (OidIsValid(msg->m_subrelid))
> +               /* Gain the apply worker stats */
> +               subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
>                     InvalidOid, true);
> +
> +       subwentry->error_count++;       /* increment the apply
> worker's counter. */
Your suggestion looks better.
Also, I fixed some comments of this part
so that we don't need to add a separate comment at the bottom
for the increment of the apply worker.

 
> 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing
> pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl.
> If possible the error_count validation can be combined with the existing tests.
> diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
> b/src/test/subscription/t/027_worker_xact_stats.pl
> new file mode 100644
> index 0000000..31dbea1
> --- /dev/null
> +++ b/src/test/subscription/t/027_worker_xact_stats.pl
> @@ -0,0 +1,162 @@
> +
> +# Copyright (c) 2021, PostgreSQL Global Development Group
> +
> +# Tests for subscription worker statistics during apply.
> +use strict;
> +use warnings;
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More tests => 1;
> +
> +# Create publisher node
Right. I've integrated my tests with 026_worker_stats.pl.
I think error_count validations are combined as you suggested.
Another change I did is to introduce one function
to contribute to better readability of the stats tests.

Here, the 026_worker_stats.pl didn't look aligned by
pgperltidy. This is not a serious issue at all.
Yet, when I ran pgperltidy, the existing codes
that required adjustments came into my patch.
Therefore, I made a separate part for this.

Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>

Few questions and comments:
========================
1.
The <structname>pg_stat_subscription_workers</structname> view will contain
    one row per subscription worker on which errors have occurred, for workers
    applying logical replication changes and workers handling the initial data
-   copy of the subscribed tables.  The statistics entry is removed when the
-   corresponding subscription is dropped.
+   copy of the subscribed tables. Also, the row corresponding to the apply
+   worker shows all transaction statistics of both types of workers on the
+   subscription. The statistics entry is removed when the corresponding
+   subscription is dropped.

Why did you choose to show stats for both types of workers in one row?

2.
+ PGSTAT_MTYPE_SUBWORKERXACTEND,
 } StatMsgType;

I don't think we comma with the last message type.

3.
+ Oid m_subrelid;
+
+ /* necessary to determine column to increment */
+ LogicalRepMsgType m_command;
+
+} PgStat_MsgSubWorkerXactEnd;

Is m_subrelid used in this patch? If not, why did you keep it? I think
if you choose to show separate stats for table sync and apply worker
then probably it will be used.

4.
  /*
+ * Cumulative transaction statistics of subscription worker
+ */
+ PgStat_Counter commit_count;
+ PgStat_Counter error_count;
+ PgStat_Counter abort_count;
+

I think it is better to keep the order of columns as commit_count,
abort_count, error_count in the entire patch.

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, December 13, 2021 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> 
> Few questions and comments:
Thank you for your comments !

> ========================
> 1.
> The <structname>pg_stat_subscription_workers</structname> view will
> contain
>     one row per subscription worker on which errors have occurred, for workers
>     applying logical replication changes and workers handling the initial data
> -   copy of the subscribed tables.  The statistics entry is removed when the
> -   corresponding subscription is dropped.
> +   copy of the subscribed tables. Also, the row corresponding to the apply
> +   worker shows all transaction statistics of both types of workers on the
> +   subscription. The statistics entry is removed when the corresponding
> +   subscription is dropped.
> 
> Why did you choose to show stats for both types of workers in one row?
This is because if we have hundreds or thousands of tables for table sync,
we need to create many entries to cover them and store the entries for all tables.


> 2.
> + PGSTAT_MTYPE_SUBWORKERXACTEND,
>  } StatMsgType;
> 
> I don't think we comma with the last message type.
> 4.
>   /*
> + * Cumulative transaction statistics of subscription worker */
> + PgStat_Counter commit_count; PgStat_Counter error_count;
> + PgStat_Counter abort_count;
> +
> 
> I think it is better to keep the order of columns as commit_count, abort_count,
> error_count in the entire patch.
Okay, I'll fix both points in the next version.

 
> 3.
> + Oid m_subrelid;
> +
> + /* necessary to determine column to increment */ LogicalRepMsgType
> + m_command;
> +
> +} PgStat_MsgSubWorkerXactEnd;
> 
> Is m_subrelid used in this patch? If not, why did you keep it?
Absolutely, this was a mistake when I took the decision to merge both stats
of table sync and apply worker.

> I think if you choose
> to show separate stats for table sync and apply worker then probably it will be
> used.
Yeah, I'll fix this. Of course, after I could confirm that the idea for merging the
two types of workers stats was acceptable for you and others.

Best Regards,
    Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, December 6, 2021 11:27 PM vignesh C <vignesh21@gmail.com> wrote:
> > Thanks for the updated patch, few comments:
> Thank you for your review !
>
> > 1) We can keep the documentation similar to mention the count includes both
> > table sync worker / main apply worker in case of commit_count/error_count
> > and abort_count to keep it consistent.
> > +       <structfield>commit_count</structfield> <type>bigint</type>
> > +      </para>
> > +      <para>
> > +       Number of transactions successfully applied in this subscription.
> > +       COMMIT and COMMIT PREPARED increments this counter.
> > +      </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>error_count</structfield> <type>bigint</type>
> > +      </para>
> > +      <para>
> > +       Number of transactions that failed to be applied by the table
> > +       sync worker or main apply worker in this subscription.
> > +       </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>abort_count</structfield> <type>bigint</type>
> > +      </para>
> > +      <para>
> > +       Number of transactions aborted in this subscription.
> > +       ROLLBACK PREPARED increments this counter.
> > +      </para></entry>
> > +     </row>
> Yeah, you are right. Fixed.
> Note that abort_count is not used by table sync worker.
>
>
> > 2) Can this be changed:
> > +       /*
> > +        * If this is a new error reported by table sync worker,
> > consolidate this
> > +        * error count into the entry of apply worker.
> > +        */
> > +       if (OidIsValid(msg->m_subrelid))
> > +       {
> > +               /* Gain the apply worker stats */
> > +               subwentry = pgstat_get_subworker_entry(dbentry,
> > + msg->m_subid,
> > +
> >                     InvalidOid, true);
> > +               subwentry->error_count++;
> > +       }
> > +       else
> > +               subwentry->error_count++;       /* increment the apply
> > worker's counter. */
> > To:
> > +       /*
> > +        * If this is a new error reported by table sync worker,
> > consolidate this
> > +        * error count into the entry of apply worker.
> > +        */
> > +       if (OidIsValid(msg->m_subrelid))
> > +               /* Gain the apply worker stats */
> > +               subwentry = pgstat_get_subworker_entry(dbentry,
> > + msg->m_subid,
> > +
> >                     InvalidOid, true);
> > +
> > +       subwentry->error_count++;       /* increment the apply
> > worker's counter. */
> Your suggestion looks better.
> Also, I fixed some comments of this part
> so that we don't need to add a separate comment at the bottom
> for the increment of the apply worker.
>
>
> > 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing
> > pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl.
> > If possible the error_count validation can be combined with the existing tests.
> > diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
> > b/src/test/subscription/t/027_worker_xact_stats.pl
> > new file mode 100644
> > index 0000000..31dbea1
> > --- /dev/null
> > +++ b/src/test/subscription/t/027_worker_xact_stats.pl
> > @@ -0,0 +1,162 @@
> > +
> > +# Copyright (c) 2021, PostgreSQL Global Development Group
> > +
> > +# Tests for subscription worker statistics during apply.
> > +use strict;
> > +use warnings;
> > +use PostgreSQL::Test::Cluster;
> > +use PostgreSQL::Test::Utils;
> > +use Test::More tests => 1;
> > +
> > +# Create publisher node
> Right. I've integrated my tests with 026_worker_stats.pl.
> I think error_count validations are combined as you suggested.
> Another change I did is to introduce one function
> to contribute to better readability of the stats tests.
>
> Here, the 026_worker_stats.pl didn't look aligned by
> pgperltidy. This is not a serious issue at all.
> Yet, when I ran pgperltidy, the existing codes
> that required adjustments came into my patch.
> Therefore, I made a separate part for this.

Thanks for the updated patch, few comments:
1) Can we change this:
        /*
+        * Report the success of table sync as one commit to consolidate all
+        * transaction stats into one record.
+        */
+       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  LOGICAL_REP_MSG_COMMIT);
+
To:
        /* Report the success of table sync */
+       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  LOGICAL_REP_MSG_COMMIT);
+

2) Typo: ealier should be earlier
+       /*
+        * Report ealier than the call of process_syncing_tables() not
to miss an
+        * increment of commit_count in case it leads to the process exit. See
+        * process_syncing_tables_for_apply().
+        */
+       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  LOGICAL_REP_MSG_COMMIT);
+

3) Should we add an Assert for subwentry:
+       /*
+        * If this is a new error reported by table sync worker,
consolidate this
+        * error count into the entry of apply worker, by swapping the stats
+        * entries.
+        */
+       if (OidIsValid(msg->m_subrelid))
+               subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
                    InvalidOid, true);
+       subwentry->error_count++;

4) Can we slightly change it to :We can change it:
+# Check the update of stats counters.
+confirm_transaction_stats_update(
+       $node_subscriber,
+       'commit_count = 1',
+       'the commit_count increment by table sync');
+
+confirm_transaction_stats_update(
+       $node_subscriber,
+       'error_count = 1',
+       'the error_count increment by table sync');
to:
+# Check updation of subscription worker transaction count statistics.
+confirm_transaction_stats_update(
+       $node_subscriber,
+       'commit_count = 1',
+       'check table sync worker commit count is updated');
+
+confirm_transaction_stats_update(
+       $node_subscriber,
+       'error_count = 1',
+       'check table sync worker error count is updated');

Regards,
Vignesh



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Mon, Dec 13, 2021 at 5:48 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, December 13, 2021 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> >
> > Few questions and comments:
> Thank you for your comments !
>
> > ========================
> > 1.
> > The <structname>pg_stat_subscription_workers</structname> view will
> > contain
> >     one row per subscription worker on which errors have occurred, for workers
> >     applying logical replication changes and workers handling the initial data
> > -   copy of the subscribed tables.  The statistics entry is removed when the
> > -   corresponding subscription is dropped.
> > +   copy of the subscribed tables. Also, the row corresponding to the apply
> > +   worker shows all transaction statistics of both types of workers on the
> > +   subscription. The statistics entry is removed when the corresponding
> > +   subscription is dropped.
> >
> > Why did you choose to show stats for both types of workers in one row?
> This is because if we have hundreds or thousands of tables for table sync,
> we need to create many entries to cover them and store the entries for all tables.
>

If we fear a large number of entries for such workers then won't it be
better to show the value of these stats only for apply workers. I
think normally the table sync workers perform only copy operation or
maybe a fixed number of xacts, so, one might not be interested in the
transaction stats of these workers. I find merging only specific stats
of two different types of workers confusing.

What do others think about this?

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"houzj.fnst@fujitsu.com"
Дата:
On Tues, Dec 14, 2021 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > ========================
> > > 1.
> > >     one row per subscription worker on which errors have occurred, for workers
> > >     applying logical replication changes and workers handling the initial data
> > > -   copy of the subscribed tables.  The statistics entry is removed when the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the apply
> > > +   worker shows all transaction statistics of both types of workers on the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all
> > tables.
> >
> 
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
> 
> What do others think about this?

Personally, I agreed that merging two types of stats into one row might not be
a good idea. And the xact stats of table sync workers are usually less
interesting than the apply worker's, So, it's seems acceptable to me if we
show stats only for apply workers and document about this.

Best regards,
Hou zj

Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Tue, Dec 14, 2021 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > ========================
> > > 1.
> > > The <structname>pg_stat_subscription_workers</structname> view will
> > > contain
> > >     one row per subscription worker on which errors have occurred, for workers
> > >     applying logical replication changes and workers handling the initial data
> > > -   copy of the subscribed tables.  The statistics entry is removed when the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the apply
> > > +   worker shows all transaction statistics of both types of workers on the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all tables.
> >
>
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats only for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
>
> What do others think about this?

I understand the concern to have a large number of entries but I agree
that merging only specific stats would confuse users. As Amit
suggested, it'd be better to show only apply workers' transaction
stats.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Failed transaction statistics to measure the logical replication progress

От
vignesh C
Дата:
On Tue, Dec 14, 2021 at 7:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > ========================
> > > 1.
> > > The <structname>pg_stat_subscription_workers</structname> view will
> > > contain
> > >     one row per subscription worker on which errors have occurred, for workers
> > >     applying logical replication changes and workers handling the initial data
> > > -   copy of the subscribed tables.  The statistics entry is removed when the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the apply
> > > +   worker shows all transaction statistics of both types of workers on the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all tables.
> >
>
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats only for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
>
> What do others think about this?

We can remove the table sync workers transaction stats count to avoid
confusion, take care of the documentation changes too accordingly.

Regards,
Vignesh



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, December 15, 2021 9:52 PM vignesh C <vignesh21@gmail.com> wrote:
> On Tue, Dec 14, 2021 at 7:58 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Mon, Dec 13, 2021 at 5:48 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Monday, December 13, 2021 6:19 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > Few questions and comments:
> > > Thank you for your comments !
> > >
> > > > ========================
> > > > 1.
> > > > The <structname>pg_stat_subscription_workers</structname> view
> > > > will contain
> > > >     one row per subscription worker on which errors have occurred, for
> workers
> > > >     applying logical replication changes and workers handling the initial
> data
> > > > -   copy of the subscribed tables.  The statistics entry is removed
> when the
> > > > -   corresponding subscription is dropped.
> > > > +   copy of the subscribed tables. Also, the row corresponding to the
> apply
> > > > +   worker shows all transaction statistics of both types of workers on
> the
> > > > +   subscription. The statistics entry is removed when the
> corresponding
> > > > +   subscription is dropped.
> > > >
> > > > Why did you choose to show stats for both types of workers in one row?
> > > This is because if we have hundreds or thousands of tables for table
> > > sync, we need to create many entries to cover them and store the entries for
> all tables.
> > >
> >
> > If we fear a large number of entries for such workers then won't it be
> > better to show the value of these stats only for apply workers. I
> > think normally the table sync workers perform only copy operation or
> > maybe a fixed number of xacts, so, one might not be interested in the
> > transaction stats of these workers. I find merging only specific stats
> > of two different types of workers confusing.
> >
> > What do others think about this?
> 
> We can remove the table sync workers transaction stats count to avoid
> confusion, take care of the documentation changes too accordingly.
Hi, apologies for my late reply.

Thank you, everyone for confirming the direction.
I'll follow the consensus of the community
and fix the patch, including other comments.
I'll treat only the stats for apply workers.


Best Regards,
    Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

От
Greg Nancarrow
Дата:

On Tue, Dec 14, 2021 at 1:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats only for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
>
> What do others think about this?
>

I think it might be OK to NOT include the transaction stats of the tablesync workers, but my understanding (and slight concern) is that currently there is potentially some overlap in the work done by the tablesync and apply workers - perhaps the small patch (see [1]) proposed by Peter Smith could also be considered, in order to make that distinction of work clearer, and the stats more meaningful?

----


Regards,
Greg Nancarrow
Fujitsu Australia

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, December 16, 2021 4:00 PM I wrote:
> Thank you, everyone for confirming the direction.
> I'll follow the consensus of the community and fix the patch, including other
> comments.
> I'll treat only the stats for apply workers.
Hi, created a new version v17 according to the recent discussion
with changes to address other review comments.

Kindly have a look at it.

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, December 13, 2021 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 7, 2021 at 3:12 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> 
> Few questions and comments:
> ========================
> 1.
> The <structname>pg_stat_subscription_workers</structname> view will
> contain
>     one row per subscription worker on which errors have occurred, for workers
>     applying logical replication changes and workers handling the initial data
> -   copy of the subscribed tables.  The statistics entry is removed when the
> -   corresponding subscription is dropped.
> +   copy of the subscribed tables. Also, the row corresponding to the apply
> +   worker shows all transaction statistics of both types of workers on the
> +   subscription. The statistics entry is removed when the corresponding
> +   subscription is dropped.
> 
> Why did you choose to show stats for both types of workers in one row?
Now, the added stats show only the statistics of apply worker
as we agreed.


> 2.
> + PGSTAT_MTYPE_SUBWORKERXACTEND,
>  } StatMsgType;
> 
> I don't think we comma with the last message type.
Fixed.

 
> 3.
> + Oid m_subrelid;
> +
> + /* necessary to determine column to increment */ LogicalRepMsgType
> + m_command;
> +
> +} PgStat_MsgSubWorkerXactEnd;
> 
> Is m_subrelid used in this patch? If not, why did you keep it? I think if you
> choose to show separate stats for table sync and apply worker then probably it
> will be used.
Removed.


> 4.
>   /*
> + * Cumulative transaction statistics of subscription worker */
> + PgStat_Counter commit_count; PgStat_Counter error_count;
> + PgStat_Counter abort_count;
> +
> 
> I think it is better to keep the order of columns as commit_count, abort_count,
> error_count in the entire patch.
Fixed.


The new patch is shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB83734A7A0596AC7ADB0DCB51ED779%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, December 14, 2021 12:45 AM vignesh C <vignesh21@gmail.com> wrote:
> Thanks for the updated patch, few comments:
> 1) Can we change this:
>         /*
> +        * Report the success of table sync as one commit to consolidate all
> +        * transaction stats into one record.
> +        */
> +       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   LOGICAL_REP_MSG_COMMIT);
> +
> To:
>         /* Report the success of table sync */
> +       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   LOGICAL_REP_MSG_COMMIT);
> +
This function call that the table sync worker reports
an update of stats has been removed according to the recent discussion.


> 2) Typo: ealier should be earlier
> +       /*
> +        * Report ealier than the call of process_syncing_tables() not
> to miss an
> +        * increment of commit_count in case it leads to the process exit. See
> +        * process_syncing_tables_for_apply().
> +        */
> +       pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   LOGICAL_REP_MSG_COMMIT);
> +
Thanks ! Fixed.

 
> 3) Should we add an Assert for subwentry:
> +       /*
> +        * If this is a new error reported by table sync worker,
> consolidate this
> +        * error count into the entry of apply worker, by swapping the stats
> +        * entries.
> +        */
> +       if (OidIsValid(msg->m_subrelid))
> +               subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
>                     InvalidOid, true);
> +       subwentry->error_count++;
The latest implementation doesn't require
the call of pgstat_get_subworker_entry().
So, I skipped.

> 4) Can we slightly change it to :We can change it:
> +# Check the update of stats counters.
> +confirm_transaction_stats_update(
> +       $node_subscriber,
> +       'commit_count = 1',
> +       'the commit_count increment by table sync');
> +
> +confirm_transaction_stats_update(
> +       $node_subscriber,
> +       'error_count = 1',
> +       'the error_count increment by table sync');
> to:
> +# Check updation of subscription worker transaction count statistics.
> +confirm_transaction_stats_update(
> +       $node_subscriber,
> +       'commit_count = 1',
> +       'check table sync worker commit count is updated');
> +
> +confirm_transaction_stats_update(
> +       $node_subscriber,
> +       'error_count = 1',
> +       'check table sync worker error count is updated');
I've removed the corresponding tests for table sync workers in the patch. 

But, I adopted the comment suggestion partly for the tests of the apply worker.
On the other hand, I didn't fix the 3rd arguments of confirm_transaction_stats_update().
It needs to be a noun, because it's connected to another string
"Timed out while waiting for ". in the function.
See the definition of the function.

The new patch v17 is shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB83734A7A0596AC7ADB0DCB51ED779%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, December 16, 2021 8:37 PM I wrote:
> Hi, created a new version v17 according to the recent discussion with changes
> to address other review comments.
FYI, in v17 I've removed one part of commit message
about spool file statistics on the subscriber.
My intention is just to make the patches more committable shape.

Although I deleted it, I'd say still there would be some room
for the discussion of the necessity. It's because to begin with,
we are interested in the disk writes (for the logical replication,
pg_stat_replication_slots is an example), and secondly there can
be a scenario that if the user of logical replication dislikes and wants
to suppress unnecessary writes of file on the subscriber
(STREAM ABORT causes truncate of file with changes, IIUC)
they can increase the logical_decoding_work_mem on the publisher.
I'll postpone this discussion, till it becomes necessary
or will abandon this idea, if it's rejected. Anyway,
I detached the discussion by removing it from the commit message.

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Kyotaro Horiguchi
Дата:
At Thu, 16 Dec 2021 11:36:46 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in 
> On Thursday, December 16, 2021 4:00 PM I wrote:
> > Thank you, everyone for confirming the direction.
> > I'll follow the consensus of the community and fix the patch, including other
> > comments.
> > I'll treat only the stats for apply workers.
> Hi, created a new version v17 according to the recent discussion
> with changes to address other review comments.
> 
> Kindly have a look at it.

It sends stats packets at every commit-like operation on apply
workers.  The current pgstat is so smart that it refrain from sending
stats packets at too high frequency.  We already suffer frequent stats
packets so apply workers need to bahave the same way.

That is, the new stats numbers are once accumulated locally then the
accumulated numbers are sent to stats collector by pgstat_report_stat.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Friday, December 17, 2021 2:03 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> It sends stats packets at every commit-like operation on apply workers.  The
> current pgstat is so smart that it refrain from sending stats packets at too high
> frequency.  We already suffer frequent stats packets so apply workers need to
> bahave the same way.
>
> That is, the new stats numbers are once accumulated locally then the
> accumulated numbers are sent to stats collector by pgstat_report_stat.
Hi, Horiguchi-san.

I felt your point is absolutely right !
Updated the patch to address your concern.

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"wangw.fnst@fujitsu.com"
Дата:
On Wednesday, December 22, 2021 10:30 PM
osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> On Wednesday, December 22, 2021 8:38 PM I wrote:
> > Do we expect these commit counts which come from empty transactions ?
> This is another issue discussed in [1]
> where the patch in the thread is a work in progress, I think.
> ......
> IMHO, the conclusion is we are currently in the middle of fixing the behavior.

Thank you for telling me this.
After applying v19-* and v15-0001-Skip-empty-transactions-for-logical-replication.patch, 
I retested v19-* patches. The result of previous case looks good to me.

But the results of following cases are also similar to previous unexpected result
which increases commit_count or abort_count unexpectedly.
[1]
(Based on environment in the previous example, set TWO_PHASE=true)
[Publisher]
begin;
insert into replica_test1 values(1,'1');
prepare transaction 'id';
commit prepared 'id';

In subscriber side, the commit_count of two records(sub1 and sub2) is increased.

[2]
(Based on environment in the previous example, set STREAMING=on)
[Publisher]
begin;
INSERT INTO replica_test1 SELECT i, md5(i::text) FROM generate_series(1, 5000) s(i);
commit;

In subscriber side, the commit_count of two records(sub1 and sub2) is increased.

[3]
(Based on environment in the previous example, set TWO_PHASE=true)
[Publisher]
begin;
insert into replica_test1 values(1,'1');
prepare transaction 'id';
rollback prepared 'id';

In subscriber side, the abort_count of two records(sub1 and sub2) is increased.

I think the problem maybe is the patch you mentioned
(Skip-empty-transactions-for-logical-replication.patch) is not finished yet.
Share this information here.


Regards,
Wang wei

RE: Failed transaction statistics to measure the logical replication progress

От
"tanghy.fnst@fujitsu.com"
Дата:
On Wednesday, December 22, 2021 6:14 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> 
> Attached the new patch v19.
> 

Thanks for your patch. I think it's better if you could add this patch to the commitfest.
Here are some comments:

1)
+       <structfield>commit_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of transactions successfully applied in this subscription.
+       Both COMMIT and COMMIT PREPARED increment this counter.
+      </para></entry>
+     </row>
...

I think the commands (like COMMIT, COMMIT PREPARED ...) can be surrounded with
"<command> </command>", thoughts?

2)
+extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker,
+                                             LogicalRepMsgType command,
+                                             bool bforce);

Should "bforce" be "force"?

3)
+ *  This should be called before the call of process_syning_tables() so to not

"process_syning_tables()" should be "process_syncing_tables()".

4)
+void
+pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool force)
+{
+    static TimestampTz last_report = 0;
+    PgStat_MsgSubWorkerXactEnd msg;
+
+    if (!force)
+    {
...
+        if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+            return;
+        last_report = now;
+    }
+
...
+    if (repWorker->commit_count == 0 && repWorker->abort_count == 0)
+        return;
...

I think it's better to check commit_count and abort_count first, then check if
reach PGSTAT_STAT_INTERVAL.
Otherwise if commit_count and abort_count are 0, it is possible that the value
of last_report has been updated but it didn't send stats in fact. In this case,
last_report is not the real time that send last message.

Regards,
Tang

RE: Failed transaction statistics to measure the logical replication progress

От
"houzj.fnst@fujitsu.com"
Дата:
On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi <osumi.takamichi@fujitsu.com> wrote:
> Attached the new patch v19.
Hi,

Thanks for updating the patch.

--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -15,6 +15,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"    /* for MAX_XFN_CHARS */
 #include "replication/logicalproto.h"
+#include "replication/worker_internal.h"

I noticed that the patch includes "worker_internal.h " in pgstat.h.
I think it might be better to only include this file in pgstat.c.

And it seems we can access MyLogicalRepWorker directly in the
following functions instead of passing a parameter.

+extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker,
+                                             LogicalRepMsgType command,
+                                             bool bforce);
+extern void pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker,
+                                             bool force);

Best regards,
Hou zj

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, January 3, 2022 2:46 PM Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> wrote:
> On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi
> <osumi.takamichi@fujitsu.com> wrote:
> > Attached the new patch v19.
> Hi,
> 
> Thanks for updating the patch.
> 
> --- a/src/include/pgstat.h
> +++ b/src/include/pgstat.h
> @@ -15,6 +15,7 @@
>  #include "portability/instr_time.h"
>  #include "postmaster/pgarch.h"    /* for MAX_XFN_CHARS */
>  #include "replication/logicalproto.h"
> +#include "replication/worker_internal.h"
> 
> I noticed that the patch includes "worker_internal.h " in pgstat.h.
> I think it might be better to only include this file in pgstat.c.
> And it seems we can access MyLogicalRepWorker directly in the following
> functions instead of passing a parameter.
> 
> +extern void pgstat_report_subworker_xact_end(LogicalRepWorker
> *repWorker,
> +
>          LogicalRepMsgType command,
> +
>          bool bforce);
> +extern void pgstat_send_subworker_xact_stats(LogicalRepWorker
> *repWorker,
> +
>          bool force);
Hi, thank you for your review !

Both are fixed. Additionally, I modified
related comments, the header comment of pgstat_send_subworker_xact_stats,
by the change.

One other new improvements in v20 is to have removed the 2nd argument of
pgstat_send_subworker_xact_stats(). When we called it from
commit-like operations, I passed 'false' without exceptions in v19
and noticed that could be removed.

Also, there's a really minor alignment in the source codes.
In pgstat.c, I placed pgstat_report_subworker_xact_end() after pgstat_report_subworker_error(),
and pgstat_send_subworker_xact_stats() after pgstat_send_subscription_purge() and so on
like the order of PgstatCollectorMain() and PgStat_Msg definition, because
my patch's new functions are added after the existing functions
for error handling functions of subscription workers.

Lastly, I changed the error report in pgstat_report_subworker_xact_end()
so that it can be more understandable easily and a bit modern.

Kindly have a look at the attached version.

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> On Wednesday, December 22, 2021 6:14 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Attached the new patch v19.
> >
> 
> Thanks for your patch. I think it's better if you could add this patch to the
> commitfest.
> Here are some comments:
Thank you for your review !
I've created one entry in the next commitfest for this patch [1]

> 
> 1)
> +       <structfield>commit_count</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of transactions successfully applied in this subscription.
> +       Both COMMIT and COMMIT PREPARED increment this counter.
> +      </para></entry>
> +     </row>
> ...
> 
> I think the commands (like COMMIT, COMMIT PREPARED ...) can be
> surrounded with "<command> </command>", thoughts?
Makes sense to me. Fixed.

Note that to the user perspective,
we should write only COMMIT and COMMIT PREPARED in the documentation.
Thus, I don't list up other commands.

I wrapped ROLLBACK PREPARED for abort_count as well.
 
> 2)
> +extern void pgstat_report_subworker_xact_end(LogicalRepWorker
> *repWorker,
> +
>          LogicalRepMsgType command,
> +
>          bool bforce);
> 
> Should "bforce" be "force"?
Fixed the typo.

> 3)
> + *  This should be called before the call of process_syning_tables() so
> + to not
> 
> "process_syning_tables()" should be "process_syncing_tables()".
Fixed.
 
> 4)
> +void
> +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool
> +force) {
> +    static TimestampTz last_report = 0;
> +    PgStat_MsgSubWorkerXactEnd msg;
> +
> +    if (!force)
> +    {
> ...
> +        if (!TimestampDifferenceExceeds(last_report, now,
> PGSTAT_STAT_INTERVAL))
> +            return;
> +        last_report = now;
> +    }
> +
> ...
> +    if (repWorker->commit_count == 0 && repWorker->abort_count ==
> 0)
> +        return;
> ...
> 
> I think it's better to check commit_count and abort_count first, then check if
> reach PGSTAT_STAT_INTERVAL.
> Otherwise if commit_count and abort_count are 0, it is possible that the value
> of last_report has been updated but it didn't send stats in fact. In this case,
> last_report is not the real time that send last message.
Yeah, agreed. This fix is right in terms of the variable name aspect.

The only scenario that we can take advantage of the previous implementation of
v19's pgstat_send_subworker_xact_stats() should be a case where
we execute a bunch of commit-like logical replication apply messages
within PGSTAT_STAT_INTERVAL intensively and continuously for long period,
because we check "repWorker->commit_count == 0 && repWorker->abort_count == 0"
just once before calling pgstat_send() in this case.
*But*, this scenario didn't look reasonable. In other words,
the way to call TimestampDifferenceExceeds() only if there's any need of
update for commit_count/abort_count looks less heavy.
Accordingly, I've fixed it as you suggested.
Also, I modified some comments in pgstat_send_subworker_xact_stats() for this change.

Kindly have a look at the v20 shared in [2].

[1] - https://commitfest.postgresql.org/37/3504/
[2] -
https://www.postgresql.org/message-id/TYCPR01MB8373AB2AE1A6EC7B9E012519ED4A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, December 23, 2021 6:37 PM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
> On Wednesday, December 22, 2021 10:30 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > On Wednesday, December 22, 2021 8:38 PM I wrote:
> > > Do we expect these commit counts which come from empty transactions ?
> > This is another issue discussed in [1] where the patch in the thread
> > is a work in progress, I think.
> > ......
> > IMHO, the conclusion is we are currently in the middle of fixing the behavior.
> 
> Thank you for telling me this.
> After applying v19-* and
> v15-0001-Skip-empty-transactions-for-logical-replication.patch,
> I retested v19-* patches. The result of previous case looks good to me.
> 
> But the results of following cases are also similar to previous unexpected result
> which increases commit_count or abort_count unexpectedly.
> [1]
> (Based on environment in the previous example, set TWO_PHASE=true)
> [Publisher] begin; insert into replica_test1 values(1,'1'); prepare transaction
> 'id'; commit prepared 'id';
> 
> In subscriber side, the commit_count of two records(sub1 and sub2) is
> increased.
> 
> [2]
> (Based on environment in the previous example, set STREAMING=on)
> [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM
> generate_series(1, 5000) s(i); commit;
> 
> In subscriber side, the commit_count of two records(sub1 and sub2) is
> increased.
> 
> [3]
> (Based on environment in the previous example, set TWO_PHASE=true)
> [Publisher] begin; insert into replica_test1 values(1,'1'); prepare transaction
> 'id'; rollback prepared 'id';
> 
> In subscriber side, the abort_count of two records(sub1 and sub2) is
> increased.
> 
> I think the problem maybe is the patch you mentioned
> (Skip-empty-transactions-for-logical-replication.patch) is not finished yet.
> Share this information here.
Hi, thank you for your report.

Yes. As the patch's commit message mentions, the patch doesn't
cover streaming and two phase transactions.

In the above reply, I said that this was an independent issue
and we were in the middle of the modification of the behavior,
but empty transaction turned out to be harmful enough for this feature.
As far as I searched, the current logical replication sends
every transaction that is unrelated to publication specification.
It means that in a common scenario where some parts of tables are
replicated but others are not, the subscription statistics will
be buried by the updates by the empty transactions for the latter,
which damages this patch's value greatly.

Therefore, I included the description in the documentation
reported by you.

The attached v21 has a couple of other minor updates
like a modification of error message text.


Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Wed, Jan 12, 2022 at 6:04 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, December 23, 2021 6:37 PM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
> > On Wednesday, December 22, 2021 10:30 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > > On Wednesday, December 22, 2021 8:38 PM I wrote:
> > > > Do we expect these commit counts which come from empty transactions ?
> > > This is another issue discussed in [1] where the patch in the thread
> > > is a work in progress, I think.
> > > ......
> > > IMHO, the conclusion is we are currently in the middle of fixing the behavior.
> >
> > Thank you for telling me this.
> > After applying v19-* and
> > v15-0001-Skip-empty-transactions-for-logical-replication.patch,
> > I retested v19-* patches. The result of previous case looks good to me.
> >
> > But the results of following cases are also similar to previous unexpected result
> > which increases commit_count or abort_count unexpectedly.
> > [1]
> > (Based on environment in the previous example, set TWO_PHASE=true)
> > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare transaction
> > 'id'; commit prepared 'id';
> >
> > In subscriber side, the commit_count of two records(sub1 and sub2) is
> > increased.
> >
> > [2]
> > (Based on environment in the previous example, set STREAMING=on)
> > [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM
> > generate_series(1, 5000) s(i); commit;
> >
> > In subscriber side, the commit_count of two records(sub1 and sub2) is
> > increased.
> >
> > [3]
> > (Based on environment in the previous example, set TWO_PHASE=true)
> > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare transaction
> > 'id'; rollback prepared 'id';
> >
> > In subscriber side, the abort_count of two records(sub1 and sub2) is
> > increased.
> >
> > I think the problem maybe is the patch you mentioned
> > (Skip-empty-transactions-for-logical-replication.patch) is not finished yet.
> > Share this information here.
> Hi, thank you for your report.
>
> Yes. As the patch's commit message mentions, the patch doesn't
> cover streaming and two phase transactions.
>
> In the above reply, I said that this was an independent issue
> and we were in the middle of the modification of the behavior,
> but empty transaction turned out to be harmful enough for this feature.
>

Isn't that because of this patch? I mean the patch is reporting count
even when during apply we haven't started any transaction. In
particular, if you would have reported stats from
apply_handle_commit_internal() when IsTransactionState() returns true,
then it shouldn't have updated the stats for an empty transaction.
Similarly, I see for rollback_prepared, you don't need to report stats
if there is no prepared transaction to rollback. I think for
commit_prepare case, we can't do much for empty xacts but why would
that be a problem of this patch? I think as far as this patch goes, it
reports the number of completed xacts (committed/aborted) in the
subscription worker, so, it doesn't need to handle any special cases
like empty transactions.

--
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Tue, Jan 4, 2022 at 5:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> > 4)
> > +void
> > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool
> > +force) {
> > +     static TimestampTz last_report = 0;
> > +     PgStat_MsgSubWorkerXactEnd msg;
> > +
> > +     if (!force)
> > +     {
> > ...
> > +             if (!TimestampDifferenceExceeds(last_report, now,
> > PGSTAT_STAT_INTERVAL))
> > +                     return;
> > +             last_report = now;
> > +     }
> > +
> > ...
> > +     if (repWorker->commit_count == 0 && repWorker->abort_count ==
> > 0)
> > +             return;
> > ...
> >
> > I think it's better to check commit_count and abort_count first, then check if
> > reach PGSTAT_STAT_INTERVAL.
> > Otherwise if commit_count and abort_count are 0, it is possible that the value
> > of last_report has been updated but it didn't send stats in fact. In this case,
> > last_report is not the real time that send last message.
> Yeah, agreed. This fix is right in terms of the variable name aspect.
>

Can't we use pgstat_report_stat() here? Basically, you can update xact
completetion counters during apply, and then from
pgstat_report_stat(), you can invoke a logical replication worker
stats-related function.

--
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 4, 2022 at 5:22 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> > > 4)
> > > +void
> > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool
> > > +force) {
> > > +     static TimestampTz last_report = 0;
> > > +     PgStat_MsgSubWorkerXactEnd msg;
> > > +
> > > +     if (!force)
> > > +     {
> > > ...
> > > +             if (!TimestampDifferenceExceeds(last_report, now,
> > > PGSTAT_STAT_INTERVAL))
> > > +                     return;
> > > +             last_report = now;
> > > +     }
> > > +
> > > ...
> > > +     if (repWorker->commit_count == 0 && repWorker->abort_count ==
> > > 0)
> > > +             return;
> > > ...
> > >
> > > I think it's better to check commit_count and abort_count first, then check if
> > > reach PGSTAT_STAT_INTERVAL.
> > > Otherwise if commit_count and abort_count are 0, it is possible that the value
> > > of last_report has been updated but it didn't send stats in fact. In this case,
> > > last_report is not the real time that send last message.
> > Yeah, agreed. This fix is right in terms of the variable name aspect.
> >
>
> Can't we use pgstat_report_stat() here? Basically, you can update xact
> completetion counters during apply, and then from
> pgstat_report_stat(), you can invoke a logical replication worker
> stats-related function.
>

If we can do this then we can save the logic this patch is trying to
introduce for PGSTAT_STAT_INTERVAL.

--
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"tanghy.fnst@fujitsu.com"
Дата:
On Wed, Jan 12, 2022 8:35 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> 
> The attached v21 has a couple of other minor updates
> like a modification of error message text.
> 
> 

Thanks for updating the patch. Here are some comments.

1) I saw the following description about pg_stat_subscription_workers view in
existing doc:

   The <structname>pg_stat_subscription_workers</structname> view will contain
   one row per subscription worker on which errors have occurred, ...

It only says  "which errors have occurred", maybe we should also mention
transactions here, right?

2)
/* ----------
+ * pgstat_send_subworker_xact_stats() -
+ *
+ *    Send a subworker's transaction stats to the collector.
+ *    The statistics are cleared upon return.

Should "The statistics are cleared upon return" changed to "The statistics are
cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL and the
transaction stats are not sent, the function will return without clearing out
statistics.

3)
+    Assert(command == LOGICAL_REP_MSG_COMMIT ||
+           command == LOGICAL_REP_MSG_STREAM_COMMIT ||
+           command == LOGICAL_REP_MSG_COMMIT_PREPARED ||
+           command == LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+    switch (command)
+    {
+        case LOGICAL_REP_MSG_COMMIT:
+        case LOGICAL_REP_MSG_STREAM_COMMIT:
+        case LOGICAL_REP_MSG_COMMIT_PREPARED:
+            MyLogicalRepWorker->commit_count++;
+            break;
+        case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+            MyLogicalRepWorker->abort_count++;
+            break;
+        default:
+            ereport(ERROR,
+                    errmsg("invalid logical message type for transaction statistics of subscription"));
+            break;
+    }

I'm not sure that do we need the assert, because it will report an error later
if command is an invalid value.

4) I noticed that the abort_count doesn't include aborted streaming transactions.
Should we take this case into consideration?

Regards,
Tang


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, February 17, 2022 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Jan 4, 2022 at 5:22 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英
> <tanghy.fnst@fujitsu.com> wrote:
> > > > 4)
> > > > +void
> > > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker,
> > > > +bool
> > > > +force) {
> > > > +     static TimestampTz last_report = 0;
> > > > +     PgStat_MsgSubWorkerXactEnd msg;
> > > > +
> > > > +     if (!force)
> > > > +     {
> > > > ...
> > > > +             if (!TimestampDifferenceExceeds(last_report, now,
> > > > PGSTAT_STAT_INTERVAL))
> > > > +                     return;
> > > > +             last_report = now;
> > > > +     }
> > > > +
> > > > ...
> > > > +     if (repWorker->commit_count == 0 && repWorker->abort_count
> > > > + ==
> > > > 0)
> > > > +             return;
> > > > ...
> > > >
> > > > I think it's better to check commit_count and abort_count first,
> > > > then check if reach PGSTAT_STAT_INTERVAL.
> > > > Otherwise if commit_count and abort_count are 0, it is possible
> > > > that the value of last_report has been updated but it didn't send
> > > > stats in fact. In this case, last_report is not the real time that send last
> message.
> > > Yeah, agreed. This fix is right in terms of the variable name aspect.
> > >
> >
> > Can't we use pgstat_report_stat() here? Basically, you can update xact
> > completetion counters during apply, and then from
> > pgstat_report_stat(), you can invoke a logical replication worker
> > stats-related function.
> >
> 
> If we can do this then we can save the logic this patch is trying to introduce for
> PGSTAT_STAT_INTERVAL.
Hi, I've encounter a couple of questions during my modification, following your advice.

In the pgstat_report_stat, we refer to the return value of
GetCurrentTransactionStopTimestamp to compare the time different from the last time.
(In my previous patch, I used GetCurrentTimestamp)

This time is updated in apply_handle_commit_internal's CommitTransactionCommand for the apply worker.
Then, if I update the subscription worker stats(commit_count/abort_count) immediately after
this CommitTransactionCommand and immediately call pgstat_report_stat in the apply_handle_commit_internal,
the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL).
Also, the time of GetCurrentTransactionStopTimestamp is not updated
even when I keep calling pgstat_report_stat repeatedly.
Then, IIUC the next possible timing that message of commit_count or abort_count
is sent to the stats collector would become the time when we execute another transaction
by the apply worker and update the time for GetCurrentTransactionStopTimestamp
and rerun pgstat_report_stat again.

So, if we keep GetCurrentTransactionStopTimestamp without change,
an update of stats depends on another new subsequent transaction if we simply merge those.
(this leads to users cannot see the latest stats information ?)
At least, I got a test failure because of this function for streaming commit case
because it uses poll_query_until to wait for stats update.

On the other hand, replacing GetCurrentTransactionStopTimestamp with
GetCurrentTimestamp in case of apply worker looks have another negative impact.
If we do so, it becomes possible that we go into the code to scan TabStatusArray with
PgStat_TableStatus's trans with non-null values, because of the timing change.

I might be able to avoid this kind of assert failure if I write the message send
function of this patch before other existing functions to send various type of messages
and return if the process is apply worker in pgstat_report_stat.
But, I can't be convinced that this way of modification is OK.

What did you think about those issues ?

Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Fri, Feb 18, 2022 at 2:04 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, February 17, 2022 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > Can't we use pgstat_report_stat() here? Basically, you can update xact
> > > completetion counters during apply, and then from
> > > pgstat_report_stat(), you can invoke a logical replication worker
> > > stats-related function.
> > >
> >
> > If we can do this then we can save the logic this patch is trying to introduce for
> > PGSTAT_STAT_INTERVAL.
> Hi, I've encounter a couple of questions during my modification, following your advice.
>
> In the pgstat_report_stat, we refer to the return value of
> GetCurrentTransactionStopTimestamp to compare the time different from the last time.
> (In my previous patch, I used GetCurrentTimestamp)
>
> This time is updated in apply_handle_commit_internal's CommitTransactionCommand for the apply worker.
> Then, if I update the subscription worker stats(commit_count/abort_count) immediately after
> this CommitTransactionCommand and immediately call pgstat_report_stat in the apply_handle_commit_internal,
> the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL).
> Also, the time of GetCurrentTransactionStopTimestamp is not updated
> even when I keep calling pgstat_report_stat repeatedly.
> Then, IIUC the next possible timing that message of commit_count or abort_count
> is sent to the stats collector would become the time when we execute another transaction
> by the apply worker and update the time for GetCurrentTransactionStopTimestamp
> and rerun pgstat_report_stat again.
>

I think but same is true in the case of the transaction in the backend
where we increment commit counter via AtEOXact_PgStat after updating
the transaction time. After that, we call pgstat_report_stat() at
later point. How is this case different?

> So, if we keep GetCurrentTransactionStopTimestamp without change,
> an update of stats depends on another new subsequent transaction if we simply merge those.
> (this leads to users cannot see the latest stats information ?)
>

I think this should be okay as these don't need to be accurate.

> At least, I got a test failure because of this function for streaming commit case
> because it uses poll_query_until to wait for stats update.
>

I feel it is not a good idea to wait for the accurate update of these counters.

-- 
With Regards,
Amit Kapila.



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Friday, February 18, 2022 8:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Feb 18, 2022 at 2:04 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Thursday, February 17, 2022 6:45 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > > wrote:
> > > > Can't we use pgstat_report_stat() here? Basically, you can update
> > > > xact completetion counters during apply, and then from
> > > > pgstat_report_stat(), you can invoke a logical replication worker
> > > > stats-related function.
> > > >
> > >
> > > If we can do this then we can save the logic this patch is trying to
> > > introduce for PGSTAT_STAT_INTERVAL.
> > Hi, I've encounter a couple of questions during my modification, following
> your advice.
> >
> > In the pgstat_report_stat, we refer to the return value of
> > GetCurrentTransactionStopTimestamp to compare the time different from
> the last time.
> > (In my previous patch, I used GetCurrentTimestamp)
> >
> > This time is updated in apply_handle_commit_internal's
> CommitTransactionCommand for the apply worker.
> > Then, if I update the subscription worker
> > stats(commit_count/abort_count) immediately after this
> > CommitTransactionCommand and immediately call pgstat_report_stat in the
> apply_handle_commit_internal, the time difference becomes too small (falls
> short of PGSTAT_STAT_INTERVAL).
> > Also, the time of GetCurrentTransactionStopTimestamp is not updated
> > even when I keep calling pgstat_report_stat repeatedly.
> > Then, IIUC the next possible timing that message of commit_count or
> > abort_count is sent to the stats collector would become the time when
> > we execute another transaction by the apply worker and update the time
> > for GetCurrentTransactionStopTimestamp
> > and rerun pgstat_report_stat again.
> >
> 
> I think but same is true in the case of the transaction in the backend where we
> increment commit counter via AtEOXact_PgStat after updating the transaction
> time. After that, we call pgstat_report_stat() at later point. How is this case
> different?
> 
> > So, if we keep GetCurrentTransactionStopTimestamp without change, an
> > update of stats depends on another new subsequent transaction if we
> simply merge those.
> > (this leads to users cannot see the latest stats information ?)
> >
> 
> I think this should be okay as these don't need to be accurate.
> 
> > At least, I got a test failure because of this function for streaming
> > commit case because it uses poll_query_until to wait for stats update.
> >
> 
> I feel it is not a good idea to wait for the accurate update of these counters.
Ah, then I had wrote tests based on totally wrong direction and made noises for it.
Sorry for that. I don't see tests for existing xact_commit/rollback count,
so I'll follow the same way.

Attached a new patch that addresses three major improvements I've got so far as comments.
1. skip increment of stats counter by empty transaction, on the subscriber side
   (except for commit prepared)
2. utilize the existing pgstat_report_stat, instead of having a similar logic newly.
3. remove the wrong tests.


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> On Wed, Jan 12, 2022 8:35 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > The attached v21 has a couple of other minor updates like a
> > modification of error message text.
> >
> >
> 
> Thanks for updating the patch. Here are some comments.
Thank you for your reivew !



> 1) I saw the following description about pg_stat_subscription_workers view in
> existing doc:
> 
>    The <structname>pg_stat_subscription_workers</structname> view will
> contain
>    one row per subscription worker on which errors have occurred, ...
> 
> It only says  "which errors have occurred", maybe we should also mention
> transactions here, right?
I wrote about this statistics in the next line but as you pointed out,
separating the description into two sentences wasn't good idea.
Fixed.



> 2)
> /* ----------
> + * pgstat_send_subworker_xact_stats() -
> + *
> + *    Send a subworker's transaction stats to the collector.
> + *    The statistics are cleared upon return.
> 
> Should "The statistics are cleared upon return" changed to "The statistics are
> cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL
> and the transaction stats are not sent, the function will return without clearing
> out statistics.
Now, the purpose of this function has become purely
to send a message and whenever it's called, the function
clears the saved stats. So, I skipped this comments now.


> 3)
> +    Assert(command == LOGICAL_REP_MSG_COMMIT ||
> +           command == LOGICAL_REP_MSG_STREAM_COMMIT ||
> +           command == LOGICAL_REP_MSG_COMMIT_PREPARED
> ||
> +           command ==
> LOGICAL_REP_MSG_ROLLBACK_PREPARED);
> +
> +    switch (command)
> +    {
> +        case LOGICAL_REP_MSG_COMMIT:
> +        case LOGICAL_REP_MSG_STREAM_COMMIT:
> +        case LOGICAL_REP_MSG_COMMIT_PREPARED:
> +            MyLogicalRepWorker->commit_count++;
> +            break;
> +        case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
> +            MyLogicalRepWorker->abort_count++;
> +            break;
> +        default:
> +            ereport(ERROR,
> +                    errmsg("invalid logical message type
> for transaction statistics of subscription"));
> +            break;
> +    }
> 
> I'm not sure that do we need the assert, because it will report an error later if
> command is an invalid value.
The Assert has been removed, along with the switch branches now.
Since there was an adivce that we should call this from apply_handle_commit_internal
and from that function, if we don't want to change this function's argument,
all we need to do is to pass a boolean value that indicates the stats is
commit_count or abort_count. Kindly have a look at the updated version.


> 4) I noticed that the abort_count doesn't include aborted streaming
> transactions.
> Should we take this case into consideration?
Hmm, we can add this into this column, when there's no objection.
I'm not sure but someone might say those should be separate columns.

The new patch v22 is shared in [2].

[2] -
https://www.postgresql.org/message-id/TYCPR01MB83737C689F8F310C87C19C1EED379%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Saturday, February 19, 2022 12:00 AM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> <tanghy.fnst@fujitsu.com> wrote:
> > On Wed, Jan 12, 2022 8:35 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > 4) I noticed that the abort_count doesn't include aborted streaming
> > transactions.
> > Should we take this case into consideration?
> Hmm, we can add this into this column, when there's no objection.
> I'm not sure but someone might say those should be separate columns.
I've addressed this point in a new v23 patch,
since there was no opinion on this so far.

Kindly have a look at the attached one.

Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Feb 21, 2022 11:46 AM Osumi, Takamichi/大墨 昂道 <osumi.takamichi@fujitsu.com> wrote:
>I've addressed this point in a new v23 patch, since there was no opinion on this so far.
>Kindly have a look at the attached one.
Thanks for updating the patch. Here is a comment:

In function apply_handle_stream_abort:
@@ -1217,6 +1219,7 @@ apply_handle_stream_abort(StringInfo s)
     {
         set_apply_error_context_xact(xid, 0);
         stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+        pgstat_report_subworker_xact_end(false);
     }
     else
     {

I think there is a problem here, pgstat_report_stat is not invoked here.
While the other three places where function pgstat_report_subworker_xact_end is
invoked, the function pgstat_report_stat is invoked.
Do we need to invoke pgstat_report_stat in apply_handle_stream_abort?

Regards,
Wang wei

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Monday, February 21, 2022 6:06 PM Monday, February 21, 2022 6:06 PM wrote:
> On Mon, Feb 21, 2022 11:46 AM Osumi, Takamichi/大墨 昂道
> <osumi.takamichi@fujitsu.com> wrote:
> >I've addressed this point in a new v23 patch, since there was no opinion on
> this so far.
> >Kindly have a look at the attached one.
> Thanks for updating the patch. Here is a comment:
> 
> In function apply_handle_stream_abort:
> @@ -1217,6 +1219,7 @@ apply_handle_stream_abort(StringInfo s)
>      {
>          set_apply_error_context_xact(xid, 0);
>          stream_cleanup_files(MyLogicalRepWorker->subid, xid);
> +        pgstat_report_subworker_xact_end(false);
>      }
>      else
>      {
> 
> I think there is a problem here, pgstat_report_stat is not invoked here.
> While the other three places where function
> pgstat_report_subworker_xact_end is invoked, the function pgstat_report_stat
> is invoked.
> Do we need to invoke pgstat_report_stat in apply_handle_stream_abort?
Hi,

I had tested this case before I posted the latest patch v23.
It works when I call pg_stat_report_stat by other transaction.

But, if we want to add pgstat_report_stat here,
I need to investigate the impact of the addition.
I'll check it and let you know.


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"tanghy.fnst@fujitsu.com"
Дата:
On Mon, Feb 21, 2022 11:46 AM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> 
> On Saturday, February 19, 2022 12:00 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> > <tanghy.fnst@fujitsu.com> wrote:
> > > On Wed, Jan 12, 2022 8:35 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > 4) I noticed that the abort_count doesn't include aborted streaming
> > > transactions.
> > > Should we take this case into consideration?
> > Hmm, we can add this into this column, when there's no objection.
> > I'm not sure but someone might say those should be separate columns.
> I've addressed this point in a new v23 patch,
> since there was no opinion on this so far.
> 
> Kindly have a look at the attached one.
> 

Thanks for updating the patch.

I found a problem when using it. When a replication workers exits, the
transaction stats should be sent to stats collector if they were not sent before
because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't
updated as expected.

I looked into it and found that the replication worker would send the 
transaction stats (if any) before it exits. But it got invalid subid in
pgstat_send_subworker_xact_stats(), which led to the following result:

postgres=# select pg_stat_get_subscription_worker(0, null);
 pg_stat_get_subscription_worker
---------------------------------
 (0,,2,0,0,,,,0,"",)
(1 row)

I think that's because subid has already been cleaned when trying to send the
stats. I printed the value of before_shmem_exit_list, the functions in this list
would be called in shmem_exit() when the worker exits.
logicalrep_worker_onexit() would clean up the worker info (including subid), and
pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit() was
called before calling pgstat_shutdown_hook().

(gdb) p before_shmem_exit_list
$1 = {{function = 0xa88f1e <pgstat_shutdown_hook>, arg = 0}, {function = 0xb619e7 <BeforeShmemExit_Files>, arg = 0},
{function= 0xb07b5c <ReplicationSlotShmemExit>, arg = 0}, {
 
    function = 0xabdd93 <logicalrep_worker_onexit>, arg = 0}, {function = 0xe30c89 <ShutdownPostgres>, arg = 0},
{function= 0x0, arg = 0} <repeats 15 times>}
 

Maybe we should make some modification to fix it.

Regards,
Tang

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, February 22, 2022 10:15 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> On Mon, Feb 21, 2022 11:46 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Saturday, February 19, 2022 12:00 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> > > <tanghy.fnst@fujitsu.com> wrote:
> > > > On Wed, Jan 12, 2022 8:35 PM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > > 4) I noticed that the abort_count doesn't include aborted
> > > > streaming transactions.
> > > > Should we take this case into consideration?
> > > Hmm, we can add this into this column, when there's no objection.
> > > I'm not sure but someone might say those should be separate columns.
> > I've addressed this point in a new v23 patch, since there was no
> > opinion on this so far.
> >
> > Kindly have a look at the attached one.
> >
> 
> Thanks for updating the patch.
> 
> I found a problem when using it. When a replication workers exits, the
> transaction stats should be sent to stats collector if they were not sent before
> because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats
> weren't updated as expected.
> 
> I looked into it and found that the replication worker would send the transaction
> stats (if any) before it exits. But it got invalid subid in
> pgstat_send_subworker_xact_stats(), which led to the following result:
> 
> postgres=# select pg_stat_get_subscription_worker(0, null);
> pg_stat_get_subscription_worker
> ---------------------------------
>  (0,,2,0,0,,,,0,"",)
> (1 row)
> 
> I think that's because subid has already been cleaned when trying to send the
> stats. I printed the value of before_shmem_exit_list, the functions in this list
> would be called in shmem_exit() when the worker exits.
> logicalrep_worker_onexit() would clean up the worker info (including subid),
> and
> pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit()
> was called before calling pgstat_shutdown_hook().
> 
> (gdb) p before_shmem_exit_list
> $1 = {{function = 0xa88f1e <pgstat_shutdown_hook>, arg = 0}, {function =
> 0xb619e7 <BeforeShmemExit_Files>, arg = 0}, {function = 0xb07b5c
> <ReplicationSlotShmemExit>, arg = 0}, {
>     function = 0xabdd93 <logicalrep_worker_onexit>, arg = 0}, {function =
> 0xe30c89 <ShutdownPostgres>, arg = 0}, {function = 0x0, arg = 0} <repeats 15
> times>}
> 
> Maybe we should make some modification to fix it.
Thank you for letting me know this issue.
I'll investigate this and will report the result.


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Tue, Feb 22, 2022 at 6:45 AM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:
>
> I found a problem when using it. When a replication workers exits, the
> transaction stats should be sent to stats collector if they were not sent before
> because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't
> updated as expected.
>
> I looked into it and found that the replication worker would send the
> transaction stats (if any) before it exits. But it got invalid subid in
> pgstat_send_subworker_xact_stats(), which led to the following result:
>
> postgres=# select pg_stat_get_subscription_worker(0, null);
>  pg_stat_get_subscription_worker
> ---------------------------------
>  (0,,2,0,0,,,,0,"",)
> (1 row)
>
> I think that's because subid has already been cleaned when trying to send the
> stats. I printed the value of before_shmem_exit_list, the functions in this list
> would be called in shmem_exit() when the worker exits.
> logicalrep_worker_onexit() would clean up the worker info (including subid), and
> pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit() was
> called before calling pgstat_shutdown_hook().
>

Yeah, I think that is a problem and maybe we can think of solving it
by sending the stats via logicalrep_worker_onexit before subid is
cleared but not sure that is a good idea. I feel we need to go back to
the idea of v21 for sending stats instead of using pgstat_report_stat.
I think the one thing which we could improve is to avoid trying to
send it each time before receiving each message by walrcv_receive and
rather try to send it before we try to wait (WaitLatchOrSocket).
Trying after each message doesn't seem to be required and could lead
to some overhead as well. What do you think?

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Mon, Feb 21, 2022 at 12:45 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Saturday, February 19, 2022 12:00 AM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> > <tanghy.fnst@fujitsu.com> wrote:
> > > On Wed, Jan 12, 2022 8:35 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > 4) I noticed that the abort_count doesn't include aborted streaming
> > > transactions.
> > > Should we take this case into consideration?
> > Hmm, we can add this into this column, when there's no objection.
> > I'm not sure but someone might say those should be separate columns.
> I've addressed this point in a new v23 patch,
> since there was no opinion on this so far.
>
> Kindly have a look at the attached one.
>

I have some comments on v23 patch:

@@ -66,6 +66,12 @@ typedef struct LogicalRepWorker
    TimestampTz last_recv_time;
    XLogRecPtr  reply_lsn;
    TimestampTz reply_time;
+
+   /*
+    * Transaction statistics of subscription worker
+    */
+   int64       commit_count;
+   int64       abort_count;
 } LogicalRepWorker;

I think that adding these statistics to the struct whose data is
allocated on the shared memory is not a good idea since they don't
need to be shared. We might want to add more statistics for
subscriptions such as insert_count and update_count in the future. I
think it's better to track these statistics in local memory either in
worker.c or pgstat.c.

+/* ----------
+ * pgstat_report_subworker_xact_end() -
+ *
+ *  Update the statistics of subscription worker and have
+ *  pgstat_report_stat send a message to stats collector
+ *  after count increment.
+ * ----------
+ */
+void
+pgstat_report_subworker_xact_end(bool is_commit)
+{
+    if (is_commit)
+        MyLogicalRepWorker->commit_count++;
+    else
+        MyLogicalRepWorker->abort_count++;
+}

It's slightly odd and it seems unnecessary to me that we modify fields
of MyLogicalRepWorker in pgstat.c. Although this function has “report”
in its name but it just increments the counter. I think we can do that
in worker.c.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Thursday, February 24, 2022 11:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I have some comments on v23 patch:
> 
> @@ -66,6 +66,12 @@ typedef struct LogicalRepWorker
>     TimestampTz last_recv_time;
>     XLogRecPtr  reply_lsn;
>     TimestampTz reply_time;
> +
> +   /*
> +    * Transaction statistics of subscription worker
> +    */
> +   int64       commit_count;
> +   int64       abort_count;
>  } LogicalRepWorker;
> 
> I think that adding these statistics to the struct whose data is allocated on the
> shared memory is not a good idea since they don't need to be shared. We might
> want to add more statistics for subscriptions such as insert_count and
> update_count in the future. I think it's better to track these statistics in local
> memory either in worker.c or pgstat.c.
Fixed.

> +/* ----------
> + * pgstat_report_subworker_xact_end() -
> + *
> + *  Update the statistics of subscription worker and have
> + *  pgstat_report_stat send a message to stats collector
> + *  after count increment.
> + * ----------
> + */
> +void
> +pgstat_report_subworker_xact_end(bool is_commit) {
> +    if (is_commit)
> +        MyLogicalRepWorker->commit_count++;
> +    else
> +        MyLogicalRepWorker->abort_count++;
> +}
> 
> It's slightly odd and it seems unnecessary to me that we modify fields of
> MyLogicalRepWorker in pgstat.c. Although this function has “report”
> in its name but it just increments the counter. I think we can do that in worker.c.
Fixed.


Also, I made the timing adjustment logic
back and now have the independent one as Amit-san suggested in [1].

Kindly have a look at v24.


[1] - https://www.postgresql.org/message-id/CAA4eK1LWYc15%3DASj1tMTEFsXtxu%3D02aGoMwq9YanUVr9-QMhdQ%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, February 23, 2022 3:30 PM Amit Kapila <amit.kapila16@gmail.com>
> On Tue, Feb 22, 2022 at 6:45 AM tanghy.fnst@fujitsu.com
> <tanghy.fnst@fujitsu.com> wrote:
> >
> > I found a problem when using it. When a replication workers exits, the
> > transaction stats should be sent to stats collector if they were not
> > sent before because it didn't reach PGSTAT_STAT_INTERVAL. But I saw
> > that the stats weren't updated as expected.
> >
> > I looked into it and found that the replication worker would send the
> > transaction stats (if any) before it exits. But it got invalid subid
> > in pgstat_send_subworker_xact_stats(), which led to the following result:
> >
> > postgres=# select pg_stat_get_subscription_worker(0, null);
> > pg_stat_get_subscription_worker
> > ---------------------------------
> >  (0,,2,0,0,,,,0,"",)
> > (1 row)
> >
> > I think that's because subid has already been cleaned when trying to
> > send the stats. I printed the value of before_shmem_exit_list, the
> > functions in this list would be called in shmem_exit() when the worker exits.
> > logicalrep_worker_onexit() would clean up the worker info (including
> > subid), and
> > pgstat_shutdown_hook() would send stats if any.
> > logicalrep_worker_onexit() was called before calling
> pgstat_shutdown_hook().
> >
> 
> Yeah, I think that is a problem and maybe we can think of solving it by sending
> the stats via logicalrep_worker_onexit before subid is cleared but not sure that
> is a good idea. I feel we need to go back to the idea of v21 for sending stats
> instead of using pgstat_report_stat.
> I think the one thing which we could improve is to avoid trying to send it each
> time before receiving each message by walrcv_receive and rather try to send it
> before we try to wait (WaitLatchOrSocket).
> Trying after each message doesn't seem to be required and could lead to some
> overhead as well. What do you think?
I agree. Fixed.

Kindly have a look at v24 shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373A3E1BE237BAF38185BF2ED3D9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Friday, February 25, 2022 7:58 AM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> Kindly have a look at v24.
Hi.

The recent commit(7a85073) has redesigned the view pg_stat_subscription_workers
and now we have pg_stat_subscription_stats. Therefore, I rebased my patch
so that my statistics patch can be applied on top of the HEAD.

In the process of this rebase, I had to drop one column
that stored error count for unique errors(which was
incremented after confirming the error is not the same as previous
one), because the commit tentatively removes the same error check
mechanism.

Therefore, this patch has apply_commit_count, apply_rollback_count only.
I slightly changed minor changes as well so that those can
become more aligned.

Kindly please have a look at the patch.

Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
Peter Smith
Дата:
Please see below my review comments for v25.

======

1. Commit message

Introduce cumulative columns of transactions of
logical replication subscriber to the pg_stat_subscription_stats view.

"cumulative columns of transactions" sounds a bit strange to me.

SUGGESTED
Introduce 2 new subscription statistics columns (apply_commit_count,
and apply_rollback_count) to the pg_stat_subscription_stats view for
counting cumulative transaction commits/rollbacks.

~~~

2. doc/src/sgml/monitoring.sgml - bug

The new SGML <row>s have been added in the wrong place!

I don't think this renders like you expect it does. Please regenerate
the help to see for yourself.

~~~

3. doc/src/sgml/monitoring.sgml - wording

+      <para>
+       Number of transactions rollbacked in this subscription. Both
+       <command>ROLLBACK</command> of transaction streamed as in-progress
+       transaction and <command>ROLLBACK PREPARED</command> increment this
+       counter.
+      </para></entry>

BEFORE
Number of transactions rollbacked in this subscription.

SUGGESTED
Number of transaction rollbacks in this subscription.

~~~

4. doc/src/sgml/monitoring.sgml - wording

+      <para>
+       Number of transactions rollbacked in this subscription. Both
+       <command>ROLLBACK</command> of transaction streamed as in-progress
+       transaction and <command>ROLLBACK PREPARED</command> increment this
+       counter.
+      </para></entry>

Trying to distinguish between the ROLLBACK of a transaction and of a
streamed in-progress transaction seems to have made this description
too complicated. I don’t think the user even cares/knows about this
(in-progress) distinction. So, I think this should just be written
more simply (like the COMMIT part was)

BEFORE
Both  <command>ROLLBACK</command> of transaction streamed as
in-progress  transaction and <command>ROLLBACK PREPARED</command>
increment this counter.

SUGGESTED
Both  <command>ROLLBACK</command> and <command>ROLLBACK
PREPARED</command> increment this counter.

~~~

5. Question - column names.

Just curious why the columns are called "apply_commit_count" and
"apply_rollback_count"? Specifically, what extra meaning do those
names have versus just calling them "commit_count" and
"rollback_count"?

~~~

6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

@@ -3421,6 +3425,60 @@ pgstat_send_slru(void)
 }

 /* ----------
+ * pgstat_report_subscription_xact() -
+ *
+ * Send a subscription transaction stats to the collector.
+ * The statistics are cleared upon sending.
+ *
+ * 'force' is true only when the subscription worker process exits.
+ * ----------
+ */
+void
+pgstat_report_subscription_xact(bool force)

6a.
I think this comment should be worded more like the other
pgstat_report_subscption_XXX comments

BEFORE
Send a subscription transaction stats to the collector.

SUGGESTED
Tell the collector about subscriptions transaction stats.

6b.
+ * 'force' is true only when the subscription worker process exits.

I thought this comment should just describe what the 'force' param
actually does in this function; not the scenario about who calls it...

~~~

7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

I think the entire function maybe should be relocated to be nearby the
other pgstat_report_subscription_XXX functions in the source.

~~~

8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ /*
+ * This function can be called even if nothing at all has happened. In
+ * this case, there's no need to go forward.
+ */

Too much information. Clearly, it is possible for this function to be
called for this case otherwise this code would not exist in the first
place :)
IMO the comment can be much simpler but still say all it needs to.

BEFORE
This function can be called even if nothing at all has happened. In
this case, there's no need to go forward.
SUGGESTED
Bailout early if nothing to do.

~~~

9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ if (subStats.subid == InvalidOid ||
+ (subStats.apply_commit_count == 0 && subStats.apply_rollback_count == 0))
+ return;

Maybe using !OisIsValid(subStats.subid) is better?

~~~

10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ /*
+ * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+ * msec since we last sent one to avoid overloading the stats
+ * collector.
+ */

SUGGESTED (2 sentences instead of 1)
Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
msec since we last sent one. This is to avoid overloading the stats
collector.

~~~

11. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ if (!force)
+ {
+ TimestampTz now = GetCurrentTimestamp();
+
+ /*
+ * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+ * msec since we last sent one to avoid overloading the stats
+ * collector.
+ */
+ if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+ return;
+ last_report = now;
+ }

(Yeah, I know there is similar code in this module but 2 wrongs do not
make a right)

I think logically it is better to put the 'now' and the 'last_report'
outside this if (!force) block. Otherwise, the forced report is not
setting the 'last_report' time and that just seems strange.

Rearranging this code is better IMO. e.g.
- the conditions are expressed positive instead of negative (!)
- only one return point instead of multiple
- the 'last_report' is always set so that strangeness is eliminated

SUGGESTED (it's the same code but rearranged)

TimestampTz now = GetCurrentTimestamp();

if (force || TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
{
/*
* Prepare and send the message.
*/
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_SUBSCRIPTIONXACT);
msg.m_databaseid = MyDatabaseId;
msg.m_subid = subStats.subid;
msg.apply_commit_count = subStats.apply_commit_count;
msg.apply_rollback_count = subStats.apply_rollback_count;
pgstat_send(&msg, sizeof(PgStat_MsgSubscriptionXact));
last_report = now;

/*
* Clear out the statistics.
*/
subStats.apply_commit_count = 0;
subStats.apply_rollback_count = 0;
}

~~~

12. src/backend/replication/logical/worker.c - LogicalRepSubscriptionStats

@@ -238,6 +238,8 @@ static ApplyErrorCallbackArg apply_error_callback_arg =
  .ts = 0,
 };

+LogicalRepSubscriptionStats subStats = {InvalidOid, 0, 0};

Maybe better to show explicit the member assignments (like the struct
above this one) to make it more clear.

~~~

13. src/backend/replication/logical/worker.c - subscription_stats_update

@@ -3372,6 +3386,22 @@ TwoPhaseTransactionGid(Oid subid, TransactionId
xid, char *gid, int szgid)
  snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
 }

+/*
+ * Update the statistics of subscription.
+ */
+static void
+subscription_stats_update(bool is_commit)
+{
+ Assert(OidIsValid(subStats.subid));
+
+ if (is_commit)
+ subStats.apply_commit_count++;
+ else
+ subStats.apply_rollback_count++;
+
+ pgstat_report_subscription_xact(false);
+}
+

I felt maybe this would be look better split into 2 functions: e.g.
subscription_stats_incr_commit() and
subscription_stats_incr_rollback(). Then it would be more readable
from all the callers instead of the vague looking
subscription_stats_update(true/false).

~~~

14. src/backend/replication/logical/worker.c - subscription_stats_update

+/*
+ * Update the statistics of subscription.
+ */
+static void
+subscription_stats_update(bool is_commit)
+{
+ Assert(OidIsValid(subStats.subid));
+
+ if (is_commit)
+ subStats.apply_commit_count++;
+ else
+ subStats.apply_rollback_count++;
+
+ pgstat_report_subscription_xact(false);
+}

Is it really necessary to be calling
pgstat_report_subscription_xactfrom here? That is already being called
in the LogicalRepApplyLoop. Isn't that enough?

~~~

15. src/backend/replication/logical/worker.c - LogicalRepApplyLoop

@@ -2717,6 +2729,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  if (endofstream)
  break;

+ pgstat_report_subscription_xact(false);
+

Wondering if this call is better to be done a couple of lines up
(above the 'if/break'). Especially if you remove the call from the
subscription_stats_update as suggested in my review comment #14.

~~~

16. src/backend/utils/adt/pgstatfuncs.c - pg_stat_get_subscription_stats

@@ -2424,7 +2424,11 @@ pg_stat_get_subscription_stats(PG_FUNCTION_ARGS)
     INT8OID, -1, 0);
  TupleDescInitEntry(tupdesc, (AttrNumber) 3, "sync_error_count",
     INT8OID, -1, 0);
- TupleDescInitEntry(tupdesc, (AttrNumber) 4, "stats_reset",
+ TupleDescInitEntry(tupdesc, (AttrNumber) 4, "apply_commit_count",
+    INT8OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 5, "sync_rollback_count",
+    INT8OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 6, "stats_reset",
     TIMESTAMPTZOID, -1, 0);

Bug? What is "sync_rollback_count"? Looks like a cut/paste error.

~~~

17. src/include/pgstat.h

+typedef struct PgStat_MsgSubscriptionXact
+{
+ PgStat_MsgHdr m_hdr;
+
+ /* determine the worker entry */
+ Oid m_databaseid;
+ Oid m_subid;
+
+ PgStat_Counter apply_commit_count;
+ PgStat_Counter apply_rollback_count;
+} PgStat_MsgSubscriptionXact;

Is that m_databaseid even needed? I did not notice it getting used
(e.g. pgstat_recv_subscription_xact does not use it). Also, wasn't
similar removed from the other subscription error stats?

~~~

18. src/include/pgstat.h

+ PgStat_Counter apply_rollback_count;
+} PgStat_MsgSubscriptionXact;
+
+

The extra blank line can be removed.

~~~

19. src/include/pgstat.h

@@ -1177,6 +1201,8 @@ extern void pgstat_send_archiver(const char
*xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_report_subscription_xact(bool force);
+

 /* ----------

The extra blank line can be removed.

~~~

20. Test for the column names.

The patch added a couple of new columns to statistics so I was
surprised there were no regression test updates needed for these? How
can that be? Shouldn’t there be just one regression test that
validates the view column names are what they are expected to be?

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Tuesday, March 1, 2022 4:12 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Please see below my review comments for v25.
> 
> ======
> 
> 1. Commit message
> 
> Introduce cumulative columns of transactions of logical replication subscriber
> to the pg_stat_subscription_stats view.
> 
> "cumulative columns of transactions" sounds a bit strange to me.
> 
> SUGGESTED
> Introduce 2 new subscription statistics columns (apply_commit_count, and
> apply_rollback_count) to the pg_stat_subscription_stats view for counting
> cumulative transaction commits/rollbacks.
Fixed.



> ~~~
> 
> 2. doc/src/sgml/monitoring.sgml - bug
> 
> The new SGML <row>s have been added in the wrong place!
> 
> I don't think this renders like you expect it does. Please regenerate the help to
> see for yourself.
Fixed.


> ~~~
> 
> 3. doc/src/sgml/monitoring.sgml - wording
> 
> +      <para>
> +       Number of transactions rollbacked in this subscription. Both
> +       <command>ROLLBACK</command> of transaction streamed as
> in-progress
> +       transaction and <command>ROLLBACK PREPARED</command>
> increment this
> +       counter.
> +      </para></entry>
> 
> BEFORE
> Number of transactions rollbacked in this subscription.
> 
> SUGGESTED
> Number of transaction rollbacks in this subscription.
Fixed.


> ~~~
> 
> 4. doc/src/sgml/monitoring.sgml - wording
> 
> +      <para>
> +       Number of transactions rollbacked in this subscription. Both
> +       <command>ROLLBACK</command> of transaction streamed as
> in-progress
> +       transaction and <command>ROLLBACK PREPARED</command>
> increment this
> +       counter.
> +      </para></entry>
> 
> Trying to distinguish between the ROLLBACK of a transaction and of a
> streamed in-progress transaction seems to have made this description too
> complicated. I don't think the user even cares/knows about this
> (in-progress) distinction. So, I think this should just be written more simply
> (like the COMMIT part was)
> 
> BEFORE
> Both  <command>ROLLBACK</command> of transaction streamed as
> in-progress  transaction and <command>ROLLBACK
> PREPARED</command> increment this counter.
> 
> SUGGESTED
> Both  <command>ROLLBACK</command> and <command>ROLLBACK
> PREPARED</command> increment this counter.
Fixed.


> ~~~
> 
> 5. Question - column names.
> 
> Just curious why the columns are called "apply_commit_count" and
> "apply_rollback_count"? Specifically, what extra meaning do those names have
> versus just calling them "commit_count" and "rollback_count"?
I think there's possibility that we'll have counters
for tablesync commit for example. So, the name prefix avoids
the overlap between the possible names.



> ~~~
> 
> 6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> @@ -3421,6 +3425,60 @@ pgstat_send_slru(void)  }
> 
>  /* ----------
> + * pgstat_report_subscription_xact() -
> + *
> + * Send a subscription transaction stats to the collector.
> + * The statistics are cleared upon sending.
> + *
> + * 'force' is true only when the subscription worker process exits.
> + * ----------
> + */
> +void
> +pgstat_report_subscription_xact(bool force)
> 
> 6a.
> I think this comment should be worded more like the other
> pgstat_report_subscption_XXX comments
> 
> BEFORE
> Send a subscription transaction stats to the collector.
> 
> SUGGESTED
> Tell the collector about subscriptions transaction stats.
Fixed.



> 6b.
> + * 'force' is true only when the subscription worker process exits.
> 
> I thought this comment should just describe what the 'force' param actually
> does in this function; not the scenario about who calls it...
Fixed.



> ~~~
> 
> 7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> I think the entire function maybe should be relocated to be nearby the other
> pgstat_report_subscription_XXX functions in the source.
I placed the pgstat_report_subscription_xact below pgstat_report_subscription_drop.
Meanwhile, pgstat_recv_subscription_xact, another new function in pgstat.c,
is already placed below pgstat_recv_subscription_error, so I kept it as it is.



> ~~~
> 
> 8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + /*
> + * This function can be called even if nothing at all has happened. In
> + * this case, there's no need to go forward.
> + */
> 
> Too much information. Clearly, it is possible for this function to be called for this
> case otherwise this code would not exist in the first place :) IMO the comment
> can be much simpler but still say all it needs to.
> 
> BEFORE
> This function can be called even if nothing at all has happened. In this case,
> there's no need to go forward.
> SUGGESTED
> Bailout early if nothing to do.
Fixed.


> ~~~
> 
> 9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + if (subStats.subid == InvalidOid ||
> + (subStats.apply_commit_count == 0 && subStats.apply_rollback_count ==
> + 0)) return;
> 
> Maybe using !OisIsValid(subStats.subid) is better?
Fixed.


> ~~~
> 
> 10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + /*
> + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> + * msec since we last sent one to avoid overloading the stats
> + * collector.
> + */
> 
> SUGGESTED (2 sentences instead of 1)
> Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> msec since we last sent one. This is to avoid overloading the stats collector.
Fixed.



> ~~~
> 
> 11. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + if (!force)
> + {
> + TimestampTz now = GetCurrentTimestamp();
> +
> + /*
> + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> + * msec since we last sent one to avoid overloading the stats
> + * collector.
> + */
> + if (!TimestampDifferenceExceeds(last_report, now,
> + PGSTAT_STAT_INTERVAL)) return; last_report = now; }
> 
> (Yeah, I know there is similar code in this module but 2 wrongs do not make a
> right)
> 
> I think logically it is better to put the 'now' and the 'last_report'
> outside this if (!force) block. Otherwise, the forced report is not setting the
> 'last_report' time and that just seems strange.
> 
> Rearranging this code is better IMO. e.g.
> - the conditions are expressed positive instead of negative (!)
> - only one return point instead of multiple
> - the 'last_report' is always set so that strangeness is eliminated
> 
> SUGGESTED (it's the same code but rearranged)
> 
> TimestampTz now = GetCurrentTimestamp();
> 
> if (force || TimestampDifferenceExceeds(last_report, now,
> PGSTAT_STAT_INTERVAL)) {
> /*
> * Prepare and send the message.
> */
> pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_SUBSCRIPTIONXACT);
> msg.m_databaseid = MyDatabaseId; msg.m_subid = subStats.subid;
> msg.apply_commit_count = subStats.apply_commit_count;
> msg.apply_rollback_count = subStats.apply_rollback_count;
> pgstat_send(&msg, sizeof(PgStat_MsgSubscriptionXact));
> last_report = now;
> 
> /*
> * Clear out the statistics.
> */
> subStats.apply_commit_count = 0;
> subStats.apply_rollback_count = 0;
> }
Yeah, your suggestion looks tidy.
Yet, I wasn't sure if I should set the 'last_report' for exit case,
since we don't use it after the worker exit.
In addition, we need to calculate GetCurrentTimestamp()
even in the case 'force' is set to true.
I'm not sure if that is correct.

So, I'd like to keep it as it as at this stage.


> ~~~
> 
> 12. src/backend/replication/logical/worker.c - LogicalRepSubscriptionStats
> 
> @@ -238,6 +238,8 @@ static ApplyErrorCallbackArg apply_error_callback_arg
> =
>   .ts = 0,
>  };
> 
> +LogicalRepSubscriptionStats subStats = {InvalidOid, 0, 0};
> 
> Maybe better to show explicit the member assignments (like the struct above
> this one) to make it more clear.
True. Fixed.


> ~~~
> 
> 13. src/backend/replication/logical/worker.c - subscription_stats_update
> 
> @@ -3372,6 +3386,22 @@ TwoPhaseTransactionGid(Oid subid, TransactionId
> xid, char *gid, int szgid)
>   snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);  }
> 
> +/*
> + * Update the statistics of subscription.
> + */
> +static void
> +subscription_stats_update(bool is_commit) {
> +Assert(OidIsValid(subStats.subid));
> +
> + if (is_commit)
> + subStats.apply_commit_count++;
> + else
> + subStats.apply_rollback_count++;
> +
> + pgstat_report_subscription_xact(false);
> +}
> +
> 
> I felt maybe this would be look better split into 2 functions: e.g.
> subscription_stats_incr_commit() and
> subscription_stats_incr_rollback(). Then it would be more readable from all
> the callers instead of the vague looking subscription_stats_update(true/false).
Okay. Fixed.

Probably, I suppose the ideal solution here would be probably to come up
with a good name for one unified function that explains the internal processing by itself.
I spent some time to try to create a new good name (e.g. "subscription_committed_stats_update")
but all ideas weren't good. Then, I decided to make them separate.


> ~~~
> 
> 14. src/backend/replication/logical/worker.c - subscription_stats_update
> 
> +/*
> + * Update the statistics of subscription.
> + */
> +static void
> +subscription_stats_update(bool is_commit) {
> +Assert(OidIsValid(subStats.subid));
> +
> + if (is_commit)
> + subStats.apply_commit_count++;
> + else
> + subStats.apply_rollback_count++;
> +
> + pgstat_report_subscription_xact(false);
> +}
> 
> Is it really necessary to be calling
> pgstat_report_subscription_xactfrom here? That is already being called in the
> LogicalRepApplyLoop. Isn't that enough?
Deleted.


> ~~~
> 
> 15. src/backend/replication/logical/worker.c - LogicalRepApplyLoop
> 
> @@ -2717,6 +2729,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
>   if (endofstream)
>   break;
> 
> + pgstat_report_subscription_xact(false);
> +
> 
> Wondering if this call is better to be done a couple of lines up (above the
> 'if/break'). Especially if you remove the call from the subscription_stats_update
> as suggested in my review comment #14.
I feel this should be after the condition.

I checked with my debugger that after the break of LogicalRepApplyLoop
(when endofstream equals true), we'd call proc_exit. This leads to 
the call pgstat_report_subscription_xact in the logicalrep_worker_exit.

In this function, we use it with 'true' argument.
So, if we call the pgstat_report_subscription_xact before the endofstream check,
we have to call the function twice in the end.

Therefore, the current position makes sense to me.


> ~~~
> 
> 16. src/backend/utils/adt/pgstatfuncs.c - pg_stat_get_subscription_stats
> 
> @@ -2424,7 +2424,11 @@
> pg_stat_get_subscription_stats(PG_FUNCTION_ARGS)
>      INT8OID, -1, 0);
>   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "sync_error_count",
>      INT8OID, -1, 0);
> - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "stats_reset",
> + TupleDescInitEntry(tupdesc, (AttrNumber) 4, "apply_commit_count",
> +    INT8OID, -1, 0);
> + TupleDescInitEntry(tupdesc, (AttrNumber) 5, "sync_rollback_count",
> +    INT8OID, -1, 0);
> + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "stats_reset",
>      TIMESTAMPTZOID, -1, 0);
> 
> Bug? What is "sync_rollback_count"? Looks like a cut/paste error.
Fixed.


> ~~~
> 
> 17. src/include/pgstat.h
> 
> +typedef struct PgStat_MsgSubscriptionXact {  PgStat_MsgHdr m_hdr;
> +
> + /* determine the worker entry */
> + Oid m_databaseid;
> + Oid m_subid;
> +
> + PgStat_Counter apply_commit_count;
> + PgStat_Counter apply_rollback_count;
> +} PgStat_MsgSubscriptionXact;
> 
> Is that m_databaseid even needed? I did not notice it getting used (e.g.
> pgstat_recv_subscription_xact does not use it). Also, wasn't similar removed
> from the other subscription error stats?
Fixed.



> ~~~
> 
> 18. src/include/pgstat.h
> 
> + PgStat_Counter apply_rollback_count;
> +} PgStat_MsgSubscriptionXact;
> +
> +
> 
> The extra blank line can be removed.
Fixed.


> ~~~
> 
> 19. src/include/pgstat.h
> 
> @@ -1177,6 +1201,8 @@ extern void pgstat_send_archiver(const char *xlog,
> bool failed);  extern void pgstat_send_bgwriter(void);  extern void
> pgstat_send_checkpointer(void);  extern void pgstat_send_wal(bool force);
> +extern void pgstat_report_subscription_xact(bool force);
> +
> 
>  /* ----------
> 
> The extra blank line can be removed.
Fixed.


> ~~~
> 
> 20. Test for the column names.
> 
> The patch added a couple of new columns to statistics so I was surprised there
> were no regression test updates needed for these? How can that be?
> Shouldn't there be just one regression test that validates the view column
> names are what they are expected to be?
In my earlier versions, I had some tests
that covered major types of transactions (e.g. stream commit,
stream abort, rollback prepared...) and waited for accurate update of new counters.
But, I dropped those after an advice the result of such kind of tests
wouldn't become stable fundamentally.

Also, I quickly checked other similar views(pg_stat_slru, pg_stat_wal_receiver)
commit logs, especially when they introduce columns.
But, I couldn't find column name validations.
So, I feel this is aligned.


Best Regards,
    Takamichi Osumi


Вложения

Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Wed, Mar 2, 2022 at 10:21 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
>
> Also, I quickly checked other similar views(pg_stat_slru, pg_stat_wal_receiver)
> commit logs, especially when they introduce columns.
> But, I couldn't find column name validations.
> So, I feel this is aligned.
>

I've looked at v26 patch and here are some random comments:

+        /* determine the subscription entry */
+        Oid                    m_subid;
+
+        PgStat_Counter apply_commit_count;
+        PgStat_Counter apply_rollback_count;

I think it's better to add the prefix "m_" to
apply_commit/rollback_count for consistency.

---
+/*
+ * Increment the counter of commit for subscription statistics.
+ */
+static void
+subscription_stats_incr_commit(void)
+{
+        Assert(OidIsValid(subStats.subid));
+
+        subStats.apply_commit_count++;
+}
+

I think we don't need the Assert() here since it should not be a
problem even if subStats.subid is InvalidOid at least in this
function.

If we remove it, we can remove both subscription_stats_incr_commit()
and +subscription_stats_incr_rollback() as well.

---
+void
+pgstat_report_subscription_xact(bool force)
+{
+        static TimestampTz last_report = 0;
+        PgStat_MsgSubscriptionXact msg;
+
+        /* Bailout early if nothing to do */
+        if (!OidIsValid(subStats.subid) ||
+                (subStats.apply_commit_count == 0 &&
subStats.apply_rollback_count == 0))
+                return;
+

+LogicalRepSubscriptionStats subStats =
+{
+        .subid = InvalidOid,
+        .apply_commit_count = 0,
+        .apply_rollback_count = 0,
+};

Do we need subStats.subid? I think we can pass MySubscription->oid (or
MyLogicalRepWorker->subid) to pgstat_report_subscription_xact() along
with the pointer of the statistics (subStats). That way, we don't need
to expose subStats.

Also, I think it's better to add "Xact" or something to the struct
name. For example, SubscriptionXactStats.

---
+
+typedef struct LogicalRepSubscriptionStats
+{
+        Oid                    subid;
+
+        int64          apply_commit_count;
+        int64          apply_rollback_count;
+} LogicalRepSubscriptionStats;

We need a description for this struct.

Probably it is better to declare it in logicalworker.h instead so that
pgstat.c includes it instead of worker_internal.h? worker_internal.h
is the header file shared by logical replication workers such as apply
worker,  tablesync worker, and launcher. So it might not be advisable
to include it in pgstat.c.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
"shiy.fnst@fujitsu.com"
Дата:
Hi,

A comments on the v26 patch.

The following document about pg_stat_subscription_stats view only says that
"showing statistics about errors", should we add something about transactions
here? 

     <row>

<entry><structname>pg_stat_subscription_stats</structname><indexterm><primary>pg_stat_subscription_stats</primary></indexterm></entry>
      <entry>One row per subscription, showing statistics about errors.
      See <link linkend="monitoring-pg-stat-subscription-stats">
      <structname>pg_stat_subscription_stats</structname></link> for details.
      </entry>
     </row>


I noticed that the v24 patch has some changes about the description of this
view. Maybe we can modify to "showing statistics about errors and transactions".

Regards,
Shi yu

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, March 2, 2022 2:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Mar 2, 2022 at 10:21 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > Also, I quickly checked other similar views(pg_stat_slru,
> > pg_stat_wal_receiver) commit logs, especially when they introduce columns.
> > But, I couldn't find column name validations.
> > So, I feel this is aligned.
> >
> 
> I've looked at v26 patch and here are some random comments:
Hi, thank you for reviewing !


> +        /* determine the subscription entry */
> +        Oid                    m_subid;
> +
> +        PgStat_Counter apply_commit_count;
> +        PgStat_Counter apply_rollback_count;
> 
> I think it's better to add the prefix "m_" to apply_commit/rollback_count for
> consistency.
Fixed.

 
> ---
> +/*
> + * Increment the counter of commit for subscription statistics.
> + */
> +static void
> +subscription_stats_incr_commit(void)
> +{
> +        Assert(OidIsValid(subStats.subid));
> +
> +        subStats.apply_commit_count++;
> +}
> +
> 
> I think we don't need the Assert() here since it should not be a problem even if
> subStats.subid is InvalidOid at least in this function.
> 
> If we remove it, we can remove both subscription_stats_incr_commit() and
> +subscription_stats_incr_rollback() as well.
Removed the Assert() from both functions.


> ---
> +void
> +pgstat_report_subscription_xact(bool force) {
> +        static TimestampTz last_report = 0;
> +        PgStat_MsgSubscriptionXact msg;
> +
> +        /* Bailout early if nothing to do */
> +        if (!OidIsValid(subStats.subid) ||
> +                (subStats.apply_commit_count == 0 &&
> subStats.apply_rollback_count == 0))
> +                return;
> +
> 
> +LogicalRepSubscriptionStats subStats =
> +{
> +        .subid = InvalidOid,
> +        .apply_commit_count = 0,
> +        .apply_rollback_count = 0,
> +};
> 
> Do we need subStats.subid? I think we can pass MySubscription->oid (or
> MyLogicalRepWorker->subid) to pgstat_report_subscription_xact() along
> with the pointer of the statistics (subStats). That way, we don't need to expose
> subStats.
Removed the subStats.subid. Also, now I pass the oid to the
pgstat_report_subscription_xact with the pointer of the statistics.

> Also, I think it's better to add "Xact" or something to the struct name. For
> example, SubscriptionXactStats.
Renamed.

> ---
> +
> +typedef struct LogicalRepSubscriptionStats {
> +        Oid                    subid;
> +
> +        int64          apply_commit_count;
> +        int64          apply_rollback_count;
> +} LogicalRepSubscriptionStats;
> 
> We need a description for this struct.
> 
> Probably it is better to declare it in logicalworker.h instead so that pgstat.c
> includes it instead of worker_internal.h? worker_internal.h is the header file
> shared by logical replication workers such as apply worker,  tablesync worker,
> and launcher. So it might not be advisable to include it in pgstat.c.
Changed the definition place to logicalworker.h
and added some explanations for it.

Attached the updated v27.


Best Regards,
    Takamichi Osumi


Вложения

RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
On Wednesday, March 2, 2022 5:29 PM Shi, Yu/侍 雨 <shiy.fnst@fujitsu.com> wrote:
> A comments on the v26 patch.
Thank you for checking the patch !

> 
> The following document about pg_stat_subscription_stats view only says that
> "showing statistics about errors", should we add something about transactions
> here?
> 
>      <row>
> 
> <entry><structname>pg_stat_subscription_stats</structname><indexterm
> ><primary>pg_stat_subscription_stats</primary></indexterm></entry>
>       <entry>One row per subscription, showing statistics about errors.
>       See <link linkend="monitoring-pg-stat-subscription-stats">
>       <structname>pg_stat_subscription_stats</structname></link> for
> details.
>       </entry>
>      </row>
> 
> 
> I noticed that the v24 patch has some changes about the description of this
> view. Maybe we can modify to "showing statistics about errors and
> transactions".
You are right. Fixed.

New patch v27 that incorporated your comments is shared in [1].
Kindly have a look at it.


[1] -
https://www.postgresql.org/message-id/TYCPR01MB837345FB72E3CE9C827AF42CED049%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


Re: Failed transaction statistics to measure the logical replication progress

От
Amit Kapila
Дата:
On Thu, Mar 3, 2022 at 8:58 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>

This patch introduces two new subscription statistics columns
(apply_commit_count and apply_rollback_count) to the
pg_stat_subscription_stats view for counting cumulative transactions
commits/rollbacks for a particular subscription. Now, users can
already see the total number of xacts committed/rolled back in a
particular database via pg_stat_database, so this can be considered
duplicate information. OTOH, some users might be interested in the
stats for a subscription to know how many transactions are
successfully applied during replication because the information in
pg_stat_database also includes the operations that happened on the
node.

I am not sure if it is worth adding this additional information or how
useful it will be for users. Does anyone else want to weigh in on
this?

If nobody else sees value in this then I feel it is better to mark
this patch as Returned with feedback or Rejected. We can come back to
it later if we see more demand for this.

-- 
With Regards,
Amit Kapila.



Re: Failed transaction statistics to measure the logical replication progress

От
Masahiko Sawada
Дата:
On Thu, Mar 24, 2022 at 12:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 3, 2022 at 8:58 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
>
> This patch introduces two new subscription statistics columns
> (apply_commit_count and apply_rollback_count) to the
> pg_stat_subscription_stats view for counting cumulative transactions
> commits/rollbacks for a particular subscription. Now, users can
> already see the total number of xacts committed/rolled back in a
> particular database via pg_stat_database, so this can be considered
> duplicate information.

Right.

> OTOH, some users might be interested in the
> stats for a subscription to know how many transactions are
> successfully applied during replication because the information in
> pg_stat_database also includes the operations that happened on the
> node.

I'm not sure how useful this information is in practice. What can we
use this information for?

IIRC the original purpose of this proposed feature is to provide a way
for the users to understand the size and count of the succeeded and
failed transactions. At some point, the patch includes the statistics
of only the counts of commits, rollbacks, and errors. If new
statistics also include the size, it might be useful to achieve the
original goal. But I’m concerned that adding only apply_commit_count
and apply_rollback_count ends up adding the duplicate statistics with
no concrete use cases.

> I am not sure if it is worth adding this additional information or how
> useful it will be for users. Does anyone else want to weigh in on
> this?
>
> If nobody else sees value in this then I feel it is better to mark
> this patch as Returned with feedback or Rejected. We can come back to
> it later if we see more demand for this.

Marking as Returned with feedback makes sense to me.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Failed transaction statistics to measure the logical replication progress

От
"osumi.takamichi@fujitsu.com"
Дата:
Hi

On Friday, March 25, 2022 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Mar 24, 2022 at 12:30 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Mar 3, 2022 at 8:58 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> >
> > This patch introduces two new subscription statistics columns
> > (apply_commit_count and apply_rollback_count) to the
> > pg_stat_subscription_stats view for counting cumulative transactions
> > commits/rollbacks for a particular subscription. Now, users can
> > already see the total number of xacts committed/rolled back in a
> > particular database via pg_stat_database, so this can be considered
> > duplicate information.
> 
> Right.
...
> > I am not sure if it is worth adding this additional information or how
> > useful it will be for users. Does anyone else want to weigh in on
> > this?
> >
> > If nobody else sees value in this then I feel it is better to mark
> > this patch as Returned with feedback or Rejected. We can come back to
> > it later if we see more demand for this.
> 
> Marking as Returned with feedback makes sense to me.
OK. Thank you so much for sharing your opinions, Sawada-san and Amit-san.

I changed the status of this entry to "Returned with feedback" accordingly.



Best Regards,
    Takamichi Osumi