Обсуждение: Logical replication timeout problem

Поиск
Список
Период
Сортировка

Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Hi,

Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker process exit with an exit code 1

2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local] LOG:  duration: 1281408.171 ms  statement: COPY schem.tab (col1, col2) FROM stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to timeout
2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG:  worker process: logical replication worker for subscription 24106654 (PID 3770) exited with exit code 1
2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG:  logical replication apply worker for subscription "subxxxx" has started
2021-09-16 12:07:50 CEST [13873]: [1-1] user=repuser,db=foo,client=127.0.0.1 LOG:  received replication command: IDENTIFY_SYSTEM

Why this happen?

Thanks a lot for your help

Fabrice

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hi,
>
> Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker
processexit with an exit code 1
 
>
> 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local] LOG:  duration: 1281408.171 ms
statement:COPY schem.tab (col1, col2) FROM stdin;
 
> 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "foo.schem.tab" system
usage:CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
 
> 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to
timeout
> 2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG:  worker process: logical replication worker for
subscription24106654 (PID 3770) exited with exit code 1
 
> 2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG:  logical replication apply worker for subscription
"subxxxx"has started
 
> 2021-09-16 12:07:50 CEST [13873]: [1-1] user=repuser,db=foo,client=127.0.0.1 LOG:  received replication command:
IDENTIFY_SYSTEM
>

Can you share the publisher-side log as well?


-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
the publisher and the subscriber run on the same postgres instance.

Regards,
Fabrice

On Fri, Sep 17, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hi,
>
> Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker process exit with an exit code 1
>
> 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local] LOG:  duration: 1281408.171 ms  statement: COPY schem.tab (col1, col2) FROM stdin;
> 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
> 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to timeout
> 2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG:  worker process: logical replication worker for subscription 24106654 (PID 3770) exited with exit code 1
> 2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG:  logical replication apply worker for subscription "subxxxx" has started
> 2021-09-16 12:07:50 CEST [13873]: [1-1] user=repuser,db=foo,client=127.0.0.1 LOG:  received replication command: IDENTIFY_SYSTEM
>

Can you share the publisher-side log as well?


--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> the publisher and the subscriber run on the same postgres instance.
>

Okay, but there is no log corresponding to operations being performed
by the publisher. By looking at current logs it is not very clear to
me what might have caused this. Did you try increasing
wal_sender_timeout and wal_receiver_timeout?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Hi Amit, 

We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's timeout after one minute from the end of this load. The table on which this load is executed is not replicated.

2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG:  duration: 1281408.171 ms  statement: COPY db.table (col1, col2) FROM stdin;

2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s

2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to timeout

Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanisms leading to this timeout.

Thanks for your help

Fabrice



On Sun, Sep 19, 2021 at 6:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> the publisher and the subscriber run on the same postgres instance.
>

Okay, but there is no log corresponding to operations being performed
by the publisher. By looking at current logs it is not very clear to
me what might have caused this. Did you try increasing
wal_sender_timeout and wal_receiver_timeout?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hi Amit,
>
> We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's
timeoutafter one minute from the end of this load. The table on which this load is executed is not replicated.
 
>
> 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG:  duration: 1281408.171 ms
statement:COPY db.table (col1, col2) FROM stdin;
 
>
> 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "db.table " system usage:
CPU:user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
 
>
> 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to
timeout
>
> Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the
mechanismsleading to this timeout.
 
>

The basic problem here seems to be that WAL Sender is not able to send
a keepalive or any other message for the configured
wal_receiver_timeout. I am not sure how that can happen but can you
once try by switching autovacuum = off? I wanted to ensure that
WALSender is not blocked due to the background process autovacuum.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Sep 20, 2021 at 5:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> >
> > Hi Amit,
> >
> > We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the
worker'stimeout after one minute from the end of this load. The table on which this load is executed is not
replicated.
> >
> > 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG:  duration: 1281408.171 ms
statement:COPY db.table (col1, col2) FROM stdin;
 
> >
> > 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "db.table " system
usage:CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
 
> >
> > 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to
timeout
> >
> > Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the
mechanismsleading to this timeout.
 
> >
>
> The basic problem here seems to be that WAL Sender is not able to send
> a keepalive or any other message for the configured
> wal_receiver_timeout. I am not sure how that can happen but can you
> once try by switching autovacuum = off? I wanted to ensure that
> WALSender is not blocked due to the background process autovacuum.
>

The other thing we can try out is to check the data in pg_locks on
publisher during one minute after the large copy is finished. This we
can try out both with and without autovacuum.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous tests. However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb.

...
-rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap
-rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap
-rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap
-rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap
-rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap
-rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap
-rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap
-rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap
-rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap
...



2021-09-20 17:11:29 CEST [12687]: [1283-1] user=,db=,client= LOG:  checkpoint starting: time
2021-09-20 17:11:31 CEST [12687]: [1284-1] user=,db=,client= LOG:  checkpoint complete: wrote 13 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=1.713 s, sync=0.001 s, total=1.718 s
; sync files=12, longest=0.001 s, average=0.001 s; distance=29 kB, estimate=352191 kB
2021-09-20 17:12:43 CEST [59986]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to timeout
2021-09-20 17:12:43 CEST [12546]: [1068-1] user=,db=,client= LOG:  worker process: logical replication worker for subscription 24215702 (PID 59986) exited with exit code 1
2021-09-20 17:12:43 CEST [39945]: [1-1] user=,db=,client= LOG:  logical replication apply worker for subscription "sub" has started
2021-09-20 17:12:43 CEST [39946]: [1-1] user=repuser,db=db,client=127.0.0.1 LOG:  received replication command: IDENTIFY_SYSTEM

Regards,

Fabrice



On Mon, Sep 20, 2021 at 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hi Amit,
>
> We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's timeout after one minute from the end of this load. The table on which this load is executed is not replicated.
>
> 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG:  duration: 1281408.171 ms  statement: COPY db.table (col1, col2) FROM stdin;
>
> 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
>
> 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to timeout
>
> Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanisms leading to this timeout.
>

The basic problem here seems to be that WAL Sender is not able to send
a keepalive or any other message for the configured
wal_receiver_timeout. I am not sure how that can happen but can you
once try by switching autovacuum = off? I wanted to ensure that
WALSender is not blocked due to the background process autovacuum.

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous
tests.However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb.
 
>
> ...
> -rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap
> -rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap
> -rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap
> -rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap
> -rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap
> -rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap
> -rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap
> -rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap
> -rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap
> ...
>

Okay, still not sure why the publisher is not sending keep_alive
messages in between spilling such a big transaction. If you see, we
have logic in WalSndLoop() wherein each time after sending data we
check whether we need to send a keep-alive message via function
WalSndKeepaliveIfNecessary(). I think to debug this problem further
you need to add some logs in function WalSndKeepaliveIfNecessary() to
see why it is not sending keep_alive messages when all these files are
being created.

Did you change the default value of
wal_sender_timeout/wal_receiver_timeout? What is the value of those
variables in your environment? Did you see the message "terminating
walsender process due to replication timeout" in your server logs?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason?
...
* Check for replication timeout. */
  WalSndCheckTimeOut();

/* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();
...

The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked to an insert is copied to snap files given that table does not take part of the logical replication.
We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that these parameters are global and changing them will also impact the physical replication.

Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associated with it.

postgres 55680 12546  0 Sep20 ?        00:00:02 postgres: aq: bgworker: logical replication worker for subscription 24651602
postgres 55681 12546  0 Sep20 ?        00:00:00 postgres: aq: wal sender process repuser 127.0.0.1(57930) idle

Kind Regards

Fabrice

On Tue, Sep 21, 2021 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous tests. However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb.
>
> ...
> -rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap
> -rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap
> -rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap
> -rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap
> -rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap
> -rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap
> -rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap
> -rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap
> -rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap
> ...
>

Okay, still not sure why the publisher is not sending keep_alive
messages in between spilling such a big transaction. If you see, we
have logic in WalSndLoop() wherein each time after sending data we
check whether we need to send a keep-alive message via function
WalSndKeepaliveIfNecessary(). I think to debug this problem further
you need to add some logs in function WalSndKeepaliveIfNecessary() to
see why it is not sending keep_alive messages when all these files are
being created.

Did you change the default value of
wal_sender_timeout/wal_receiver_timeout? What is the value of those
variables in your environment? Did you see the message "terminating
walsender process due to replication timeout" in your server logs?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what
reason?
> ...
> * Check for replication timeout. */
>   WalSndCheckTimeOut();
>
> /* Send keepalive if the time has come */
>   WalSndKeepaliveIfNecessary();
> ...
>

Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

> The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked
toan insert is copied to snap files given that table does not take part of the logical replication.
 
>

It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1].

> We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem
isthat these parameters are global and changing them will also impact the physical replication.
 
>

Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.

> Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender
associatedwith it.
 
>

Right, I know that but I was curious to know if the walsender has
exited before walreceiver.

[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
> IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function

> I was curious to know if the walsender has exited before walreceiver

During the last tests we made we didn't observe any timeout of the wal sender process.

> Do you mean you are planning to change from 1 minute to 5 minutes?

We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather positive there is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes.
How to explain that behaviour, why the snap files are consumed suddenly so quickly.
I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate from your point of view?

Best Regards

Fabrice



On Tue, Sep 21, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason?
> ...
> * Check for replication timeout. */
>   WalSndCheckTimeOut();
>
> /* Send keepalive if the time has come */
>   WalSndKeepaliveIfNecessary();
> ...
>

Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

> The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked to an insert is copied to snap files given that table does not take part of the logical replication.
>

It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1].

> We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that these parameters are global and changing them will also impact the physical replication.
>

Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.

> Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associated with it.
>

Right, I know that but I was curious to know if the walsender has
exited before walreceiver.

[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> > IIUC, these are called after processing each WAL record so not
> sure how is it possible in your case that these are not reached?
>
> I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function
>
> > I was curious to know if the walsender has exited before walreceiver
>
> During the last tests we made we didn't observe any timeout of the wal sender process.
>
> > Do you mean you are planning to change from 1 minute to 5 minutes?
>
> We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather
positivethere is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes.
 
> How to explain that behaviour, why the snap files are consumed suddenly so quickly.
>

I think it is because we decide that the data in those snap files
doesn't need to be sent at xact end, so we remove them.

> I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate
fromyour point of view?
 
>

It is difficult to say what is the appropriate value for these
parameters unless in some way we debug WalSndKeepaliveIfNecessary() to
find why it didn't send keep alive when it is expected. Would you be
able to make code changes and test or if you want I can make changes
and send the patch if you can test it? If not, is it possible that in
some way you send a reproducible test?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
If you would like I can test the patch you send to me.

Regards

Fabrice

On Wed, Sep 22, 2021 at 11:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> > IIUC, these are called after processing each WAL record so not
> sure how is it possible in your case that these are not reached?
>
> I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function
>
> > I was curious to know if the walsender has exited before walreceiver
>
> During the last tests we made we didn't observe any timeout of the wal sender process.
>
> > Do you mean you are planning to change from 1 minute to 5 minutes?
>
> We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather positive there is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes.
> How to explain that behaviour, why the snap files are consumed suddenly so quickly.
>

I think it is because we decide that the data in those snap files
doesn't need to be sent at xact end, so we remove them.

> I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate from your point of view?
>

It is difficult to say what is the appropriate value for these
parameters unless in some way we debug WalSndKeepaliveIfNecessary() to
find why it didn't send keep alive when it is expected. Would you be
able to make code changes and test or if you want I can make changes
and send the patch if you can test it? If not, is it possible that in
some way you send a reproducible test?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If you would like I can test the patch you send to me.
>

Okay, please find an attached patch for additional logs. I would like
to see the logs during the time when walsender appears to be writing
to files. We might need to add more logs to find the exact problem but
let's start with this.

-- 
With Regards,
Amit Kapila.

Вложения

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Thanks for your patch, we are going to set up a lab in order to debug the function.
Regards
Fabrice

On Thu, Sep 23, 2021 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If you would like I can test the patch you send to me.
>

Okay, please find an attached patch for additional logs. I would like
to see the logs during the time when walsender appears to be writing
to files. We might need to add more logs to find the exact problem but
let's start with this.

--
With Regards,
Amit Kapila.

RE: Logical replication timeout problem

От
Tang, Haiying/唐 海英
Дата:

On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com> wrote:

>

> Thanks for your patch, we are going to set up a lab in order to debug the function.

 

Hi

 

I tried to reproduce this timeout problem on version10.18 but failed.

In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.

And with the patch provided by Amit, I saw that the frequency of invoking

WalSndKeepaliveIfNecessary function is raised after I inserted data.

 

The test script is attached. Maybe you can try it on your machine and check if this problem could happen.

If I miss something in the script, please let me know.

Of course, it will be better if you can provide your script to reproduce the problem.

 

Regards

Tang

Вложения

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Thanks Tang for your script. 
Our debugging environment will be ready soon. I will test your script and we will try to reproduce the problem by integrating the patch provided by Amit. As soon as I have results I will let you know.

Regards

Fabrice

On Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:

On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com> wrote:

>

> Thanks for your patch, we are going to set up a lab in order to debug the function.

 

Hi

 

I tried to reproduce this timeout problem on version10.18 but failed.

In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.

And with the patch provided by Amit, I saw that the frequency of invoking

WalSndKeepaliveIfNecessary function is raised after I inserted data.

 

The test script is attached. Maybe you can try it on your machine and check if this problem could happen.

If I miss something in the script, please let me know.

Of course, it will be better if you can provide your script to reproduce the problem.

 

Regards

Tang

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Hello,
Our lab is ready now. Amit,  I compile Postgres 10.18 with your patch.Tang, I used your script to configure logical replication between 2 databases and to generate 10 million entries in an unreplicated foo table. On a standalone instance no error message appears in log.
I activate the physical replication between 2 nodes, and I got following error:

2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive message
2021-11-10 10:49:12.297 CET [12126] STATEMENT:  START_REPLICATION 0/3000000 TIMELINE 1
2021-11-10 10:49:15.127 CET [12064] FATAL:  terminating logical replication worker due to administrator command
2021-11-10 10:49:15.127 CET [12036] LOG:  worker process: logical replication worker for subscription 16413 (PID 12064) exited with exit code 1
2021-11-10 10:49:15.155 CET [12126] LOG:  attempt to send keep alive message

This message look like strange because no admin command have been executed during data load.
I did not find any error related to the timeout.
The message coming from the modification made with the patch comes back all the time: attempt to send keep alive message. But there is no "sent keep alive message".

Why logical replication worker exit when physical replication is configured?

Thanks for your help

Fabrice



On Fri, Oct 8, 2021 at 9:33 AM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Thanks Tang for your script. 
Our debugging environment will be ready soon. I will test your script and we will try to reproduce the problem by integrating the patch provided by Amit. As soon as I have results I will let you know.

Regards

Fabrice

On Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:

On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com> wrote:

>

> Thanks for your patch, we are going to set up a lab in order to debug the function.

 

Hi

 

I tried to reproduce this timeout problem on version10.18 but failed.

In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.

And with the patch provided by Amit, I saw that the frequency of invoking

WalSndKeepaliveIfNecessary function is raised after I inserted data.

 

The test script is attached. Maybe you can try it on your machine and check if this problem could happen.

If I miss something in the script, please let me know.

Of course, it will be better if you can provide your script to reproduce the problem.

 

Regards

Tang

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
<fabrice636861@gmail.com> wrote:
>
> Hello,
> Our lab is ready now. Amit,  I compile Postgres 10.18 with your patch.Tang, I used your script to configure logical
replicationbetween 2 databases and to generate 10 million entries in an unreplicated foo table. On a standalone
instanceno error message appears in log. 
> I activate the physical replication between 2 nodes, and I got following error:
>
> 2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive message
> 2021-11-10 10:49:12.297 CET [12126] STATEMENT:  START_REPLICATION 0/3000000 TIMELINE 1
> 2021-11-10 10:49:15.127 CET [12064] FATAL:  terminating logical replication worker due to administrator command
> 2021-11-10 10:49:15.127 CET [12036] LOG:  worker process: logical replication worker for subscription 16413 (PID
12064)exited with exit code 1 
> 2021-11-10 10:49:15.155 CET [12126] LOG:  attempt to send keep alive message
>
> This message look like strange because no admin command have been executed during data load.
> I did not find any error related to the timeout.
> The message coming from the modification made with the patch comes back all the time: attempt to send keep alive
message.But there is no "sent keep alive message". 
>
> Why logical replication worker exit when physical replication is configured?
>

I am also not sure why that happened may be due to
max_worker_processes reaching its limit. This can happen because it
seems you configured both publisher and subscriber in the same
cluster. Tang, did you also see the same problem?

BTW, why are you bringing physical standby configuration into the
test? Does in your original setup where you observe the problem the
physical standbys were there?

--
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"tanghy.fnst@fujitsu.com"
Дата:
On Friday, November 12, 2021 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
> <fabrice636861@gmail.com> wrote:
> >
> > Hello,
> > Our lab is ready now. Amit,  I compile Postgres 10.18 with your patch.Tang, I
> used your script to configure logical replication between 2 databases and to
> generate 10 million entries in an unreplicated foo table. On a standalone instance
> no error message appears in log.
> > I activate the physical replication between 2 nodes, and I got following error:
> >
> > 2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive
> message
> > 2021-11-10 10:49:12.297 CET [12126] STATEMENT:  START_REPLICATION
> 0/3000000 TIMELINE 1
> > 2021-11-10 10:49:15.127 CET [12064] FATAL:  terminating logical replication
> worker due to administrator command
> > 2021-11-10 10:49:15.127 CET [12036] LOG:  worker process: logical replication
> worker for subscription 16413 (PID 12064) exited with exit code 1
> > 2021-11-10 10:49:15.155 CET [12126] LOG:  attempt to send keep alive
> message
> >
> > This message look like strange because no admin command have been executed
> during data load.
> > I did not find any error related to the timeout.
> > The message coming from the modification made with the patch comes back all
> the time: attempt to send keep alive message. But there is no "sent keep alive
> message".
> >
> > Why logical replication worker exit when physical replication is configured?
> >
> 
> I am also not sure why that happened may be due to
> max_worker_processes reaching its limit. This can happen because it
> seems you configured both publisher and subscriber in the same
> cluster. Tang, did you also see the same problem?
> 

No.
I used the default max_worker_processes value, ran logical replication and
physical replication at the same time. I also changed the data in table on
publisher. But didn't see the same problem.

Regards
Tang

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
I made a mistake in the configuration of my test script, in fact I cannot reproduce the problem at the moment.
Yes, on the original environment there is physical replication, that's why for the lab I configured 2 nodes with physical replication.
I'll try new tests next week
Regards

On Fri, Nov 12, 2021 at 7:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
<fabrice636861@gmail.com> wrote:
>
> Hello,
> Our lab is ready now. Amit,  I compile Postgres 10.18 with your patch.Tang, I used your script to configure logical replication between 2 databases and to generate 10 million entries in an unreplicated foo table. On a standalone instance no error message appears in log.
> I activate the physical replication between 2 nodes, and I got following error:
>
> 2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive message
> 2021-11-10 10:49:12.297 CET [12126] STATEMENT:  START_REPLICATION 0/3000000 TIMELINE 1
> 2021-11-10 10:49:15.127 CET [12064] FATAL:  terminating logical replication worker due to administrator command
> 2021-11-10 10:49:15.127 CET [12036] LOG:  worker process: logical replication worker for subscription 16413 (PID 12064) exited with exit code 1
> 2021-11-10 10:49:15.155 CET [12126] LOG:  attempt to send keep alive message
>
> This message look like strange because no admin command have been executed during data load.
> I did not find any error related to the timeout.
> The message coming from the modification made with the patch comes back all the time: attempt to send keep alive message. But there is no "sent keep alive message".
>
> Why logical replication worker exit when physical replication is configured?
>

I am also not sure why that happened may be due to
max_worker_processes reaching its limit. This can happen because it
seems you configured both publisher and subscriber in the same
cluster. Tang, did you also see the same problem?

BTW, why are you bringing physical standby configuration into the
test? Does in your original setup where you observe the problem the
physical standbys were there?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Dec 22, 2021 at 8:50 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hello Amit,
>
> I was able to reproduce the timeout problem in the lab.
> After loading more than 20 millions of rows in a table which is not replicated (insert command ends without error),
errorsrelated to logical replication processes appear in the postgres log.
 
> Approximately every 5 minutes worker process is restarted. The snap files in the slot directory are still present.
Thereplication system seems to be blocked. Why these snap files are not removed. What do they contain?
 
>

These contain changes of insert. I think these are not removed for
your case as your long transaction is never finished. As mentioned
earlier, for such cases, it is better to use 'streaming' feature
released as part of PG-14 but anyway here we are trying to debug
timeout problem.

> I will recompile postgres with your patch to debug.
>

Okay, that might help.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.

2021-12-28 10:50:01.894 CET [55792] LOCATION:  WalSndKeepalive, walsender.c:3365
...
2021-12-28 10:50:31.854 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG:  00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION:  ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG:  00000: restored 4096/22603999 changes from disk
...
2021-12-28 10:50:35.341 CET [55794] DEBUG:  00000: sending replication keepalive
2021-12-28 10:50:35.341 CET [55794] LOCATION:  WalSndKeepalive, walsender.c:3365

...
2021-12-28 10:51:31.995 CET [55791] ERROR:  XX000: terminating logical replication worker due to timeout
2021-12-28 10:51:31.995 CET [55791] LOCATION:  LogicalRepApplyLoop, worker.c:1267

Could this function in Apply main loop in worker.c help to find a solution?

rc = WaitLatchOrSocket(MyLatch,
WL_SOCKET_READABLE | WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
fd, wait_time,
WAIT_EVENT_LOGICAL_APPLY_MAIN);

Thanks for your help

Fabrice

On Thu, Dec 23, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Dec 22, 2021 at 8:50 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hello Amit,
>
> I was able to reproduce the timeout problem in the lab.
> After loading more than 20 millions of rows in a table which is not replicated (insert command ends without error), errors related to logical replication processes appear in the postgres log.
> Approximately every 5 minutes worker process is restarted. The snap files in the slot directory are still present. The replication system seems to be blocked. Why these snap files are not removed. What do they contain?
>

These contain changes of insert. I think these are not removed for
your case as your long transaction is never finished. As mentioned
earlier, for such cases, it is better to use 'streaming' feature
released as part of PG-14 but anyway here we are trying to debug
timeout problem.

> I will recompile postgres with your patch to debug.
>

Okay, that might help.

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.

2021-12-28 10:50:01.894 CET [55792] LOCATION:  WalSndKeepalive, walsender.c:3365
...
2021-12-28 10:50:31.854 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG:  00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION:  ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG:  00000: restored 4096/22603999 changes from disk
...
2021-12-28 10:50:35.341 CET [55794] DEBUG:  00000: sending replication keepalive
2021-12-28 10:50:35.341 CET [55794] LOCATION:  WalSndKeepalive, walsender.c:3365

...
2021-12-28 10:51:31.995 CET [55791] ERROR:  XX000: terminating logical replication worker due to timeout
2021-12-28 10:51:31.995 CET [55791] LOCATION:  LogicalRepApplyLoop, worker.c:1267


It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?

Could this function in Apply main loop in worker.c help to find a solution?

rc = WaitLatchOrSocket(MyLatch,
WL_SOCKET_READABLE | WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
fd, wait_time,
WAIT_EVENT_LOGICAL_APPLY_MAIN);


Can you explain why you think this will help in solving your current problem?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Can you explain why you think this will help in solving your current problem?

Indeed your are right this function won't help, we have to look elsewhere.

It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?

Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This process end when1420 files are present in pg_replslots (this is in relation with statements that must be replayed from WAL). In the pg_stat_replication view, the state field is set to catchup.
In a 2nd phase, the snap files must be decoded. However after one minute (wal_receiver_timeout parameter set to 1 minute) the worker process stop with a timeout.

I can put a debug point to check if a timeout is sent to the worker process. Do you have any other clue?

Thank you for your help

Fabrice




On Fri, Jan 7, 2022 at 11:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.

2021-12-28 10:50:01.894 CET [55792] LOCATION:  WalSndKeepalive, walsender.c:3365
...
2021-12-28 10:50:31.854 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG:  00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION:  ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG:  00000: restored 4096/22603999 changes from disk
...
2021-12-28 10:50:35.341 CET [55794] DEBUG:  00000: sending replication keepalive
2021-12-28 10:50:35.341 CET [55794] LOCATION:  WalSndKeepalive, walsender.c:3365

...
2021-12-28 10:51:31.995 CET [55791] ERROR:  XX000: terminating logical replication worker due to timeout
2021-12-28 10:51:31.995 CET [55791] LOCATION:  LogicalRepApplyLoop, worker.c:1267


It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?

Could this function in Apply main loop in worker.c help to find a solution?

rc = WaitLatchOrSocket(MyLatch,
WL_SOCKET_READABLE | WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
fd, wait_time,
WAIT_EVENT_LOGICAL_APPLY_MAIN);


Can you explain why you think this will help in solving your current problem?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Jan 11, 2022 at 8:13 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Can you explain why you think this will help in solving your current problem?

Indeed your are right this function won't help, we have to look elsewhere.

It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?

Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This process end when1420 files are present in pg_replslots (this is in relation with statements that must be replayed from WAL). In the pg_stat_replication view, the state field is set to catchup.
In a 2nd phase, the snap files must be decoded. However after one minute (wal_receiver_timeout parameter set to 1 minute) the worker process stop with a timeout.


What exactly do you mean by the first and second phase in the above description?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
first phase: postgres read WAL files and generate 1420 snap files.
second phase: I guess, but on this point maybe you can clarify, postgres has to decode the snap files and remove them if no statement must be applied on a replicated table.
It is from this point that the worker process exit after 1 minute timeout.

On Wed, Jan 12, 2022 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jan 11, 2022 at 8:13 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Can you explain why you think this will help in solving your current problem?

Indeed your are right this function won't help, we have to look elsewhere.

It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?

Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This process end when1420 files are present in pg_replslots (this is in relation with statements that must be replayed from WAL). In the pg_stat_replication view, the state field is set to catchup.
In a 2nd phase, the snap files must be decoded. However after one minute (wal_receiver_timeout parameter set to 1 minute) the worker process stop with a timeout.


What exactly do you mean by the first and second phase in the above description?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Jan 13, 2022 at 3:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> first phase: postgres read WAL files and generate 1420 snap files.
> second phase: I guess, but on this point maybe you can clarify, postgres has to decode the snap files and remove them
ifno statement must be applied on a replicated table.
 
> It is from this point that the worker process exit after 1 minute timeout.
>

Okay, I think the problem could be that because we are skipping all
the changes of transaction there is no communication sent to the
subscriber and it eventually timed out. Actually, we try to send
keep-alive at transaction boundaries like when we call
pgoutput_commit_txn. The pgoutput_commit_txn will call
OutputPluginWrite->WalSndWriteData. I think to tackle the problem we
need to try to send such keepalives via WalSndUpdateProgress and
invoke that in pgoutput_change when we skip sending the change.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
If I can follow you, I have to make the following changes:

1. In walsender.c:

static void
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
{
static TimestampTz sendTime = 0;
TimestampTz now = GetCurrentTimestamp();

/* Keep the worker process alive */
WalSndKeepalive(true);
/*
* Track lag no more than once per WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS to
* avoid flooding the lag tracker when we commit frequently.
*/
#define WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS 1000
if (!TimestampDifferenceExceeds(sendTime, now,
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS))
return;

LagTrackerWrite(lsn, now);
sendTime = now;
}

I put requestReply parameter to true, is that correct?

2. In pgoutput.c

/*
 * Sends the decoded DML over wire.
 *
 * This is called both in streaming and non-streaming modes.
 */
static void
pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
MemoryContext old;
RelationSyncEntry *relentry;
TransactionId xid = InvalidTransactionId;
Relation ancestor = NULL;

WalSndUpdateProgress(ctx, txn->origin_lsn,  change->txn->xid);

if (!is_publishable_relation(relation))
return;
...

Make a call to WalSndUpdateProgress in function pgoutput_change.

For info: the information in the log after reproducing the problem.

2022-01-13 11:19:46.340 CET [82233] LOCATION:  WalSndKeepaliveIfNecessary, walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2022-01-13 11:19:46.340 CET [82233] LOG:  00000: attempt to send keep alive message
2022-01-13 11:19:46.340 CET [82233] LOCATION:  WalSndKeepaliveIfNecessary, walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2022-01-13 11:19:46.340 CET [82233] LOG:  00000: attempt to send keep alive message
2022-01-13 11:19:46.340 CET [82233] LOCATION:  WalSndKeepaliveIfNecessary, walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT:  START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2022-01-13 11:20:46.418 CET [82232] ERROR:  XX000: terminating logical replication worker due to timeout
2022-01-13 11:20:46.418 CET [82232] LOCATION:  LogicalRepApplyLoop, worker.c:1267
2022-01-13 11:20:46.421 CET [82224] LOG:  00000: worker process: logical replication worker for subscription 26994 (PID 82232) exited with exit code 1

2022-01-13 11:20:46.421 CET [82224] LOCATION:  LogChildExit, postmaster.c:3625

Thanks a lot for your help.

Fabrice

On Thu, Jan 13, 2022 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jan 13, 2022 at 3:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> first phase: postgres read WAL files and generate 1420 snap files.
> second phase: I guess, but on this point maybe you can clarify, postgres has to decode the snap files and remove them if no statement must be applied on a replicated table.
> It is from this point that the worker process exit after 1 minute timeout.
>

Okay, I think the problem could be that because we are skipping all
the changes of transaction there is no communication sent to the
subscriber and it eventually timed out. Actually, we try to send
keep-alive at transaction boundaries like when we call
pgoutput_commit_txn. The pgoutput_commit_txn will call
OutputPluginWrite->WalSndWriteData. I think to tackle the problem we
need to try to send such keepalives via WalSndUpdateProgress and
invoke that in pgoutput_change when we skip sending the change.

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I can follow you, I have to make the following changes:
>

No, not like that but we can try that way as well to see if that helps
to avoid your problem. Am, I understanding correctly even after
modification, you are seeing the problem. Can you try by calling
WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
if it takes little work for you, can you please send me a piece of code with the change needed to do the test

Thanks 

Regards,

Fabrice

On Fri, Jan 14, 2022 at 1:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I can follow you, I have to make the following changes:
>

No, not like that but we can try that way as well to see if that helps
to avoid your problem. Am, I understanding correctly even after
modification, you are seeing the problem. Can you try by calling
WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?

--
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:

Hello Amit,







If it takes little work for you, can you please send me a piece of code
with the change needed to do the test

Thanks

Regards,

Fabrice


On Fri, Jan 14, 2022 at 1:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I can follow you, I have to make the following changes:
>

No, not like that but we can try that way as well to see if that helps
to avoid your problem. Am, I understanding correctly even after
modification, you are seeing the problem. Can you try by calling
WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?

--
With Regards,
Amit Kapila.

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:

On Wed, Jan 19, 2022 at 9:53 PM Fabrice Chapuis fabrice636861@gmail.com wrote:

> Hello Amit,

> If it takes little work for you, can you please send me a piece of code

> with the change needed to do the test

 

I wrote a patch(Send-keepalive.patch, please refer to attachment) according to

Amit's suggestions. But after I did some simple test about this patch by the

test script "test.sh"(please refer to attachment), I found the timeout problem

has not been fixed by this patch.

 

So I add some logs(please refer to Add-some-logs-to-debug.patch) to confirm newly

added WalSndKeepaliveIfNecessary() send keepalive message or not.

 

After applying the Send-keepalive.patch and Add-some-logs-to-debug.patch, I

found that the added message "send keep alive message" was not printed in

publisher-side log.

 

[publisher-side log]:

2022-01-20 15:21:50.057 CST [2400278] LOG:  checkpoint complete: wrote 61 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=9.838 s, sync=0.720 s, total=10.559 s; sync files=4, longest=0.563 s, average=0.180 s; distance=538053 kB, estimate=543889 kB

2022-01-20 15:21:50.977 CST [2400278] LOG:  checkpoints are occurring too frequently (11 seconds apart)

2022-01-20 15:21:50.977 CST [2400278] HINT:  Consider increasing the configuration parameter "max_wal_size".

2022-01-20 15:21:50.988 CST [2400278] LOG:  checkpoint starting: wal

2022-01-20 15:21:52.853 CST [2400404] LOG:  begin load changes

2022-01-20 15:21:52.853 CST [2400404] STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', publication_names '"pub"')

2022-01-20 15:22:52.969 CST [2410649] ERROR:  replication slot "sub" is active for PID 2400404

2022-01-20 15:22:52.969 CST [2410649] STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', publication_names '"pub"')

2022-01-20 15:22:57.980 CST [2410657] ERROR:  replication slot "sub" is active for PID 2400404

 

[subscriber-side log]:

2022-01-20 15:16:10.975 CST [2400335] LOG:  checkpoint starting: time

2022-01-20 15:16:16.052 CST [2400335] LOG:  checkpoint complete: wrote 51 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.830 s, sync=0.135 s, total=5.078 s; sync files=39, longest=0.079 s, average=0.004 s; distance=149 kB, estimate=149 kB

2022-01-20 15:22:52.738 CST [2400400] ERROR:  terminating logical replication worker due to timeout

2022-01-20 15:22:52.738 CST [2400332] LOG:  background worker "logical replication worker" (PID 2400400) exited with exit code 1

2022-01-20 15:22:52.740 CST [2410648] LOG:  logical replication apply worker for subscription "sub" has started

2022-01-20 15:22:52.969 CST [2410648] ERROR:  could not start WAL streaming: ERROR:  replication slot "sub" is active for PID 2400404

2022-01-20 15:22:52.970 CST [2400332] LOG:  background worker "logical replication worker" (PID 2410648) exited with exit code 1

2022-01-20 15:22:57.977 CST [2410656] LOG:  logical replication apply worker for subscription "sub" has started

 

It seems WalSndKeepaliveIfNecessary did not send keepalive message in testing. I

am still doing some research about the cause.

 

Attach the patches and test script mentioned above, in case someone wants to try.

If I miss something, please let me know.

 

Regards,

Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Jan 20, 2022 at 2:35 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Wed, Jan 19, 2022 at 9:53 PM Fabrice Chapuis fabrice636861@gmail.com wrote:
>
> > Hello Amit,
>
> > If it takes little work for you, can you please send me a piece of code
>
> > with the change needed to do the test
>
>
>
> I wrote a patch(Send-keepalive.patch, please refer to attachment) according to
>
> Amit's suggestions. But after I did some simple test about this patch by the
>
> test script "test.sh"(please refer to attachment), I found the timeout problem
>
> has not been fixed by this patch.
>
>
>
> So I add some logs(please refer to Add-some-logs-to-debug.patch) to confirm newly
>
> added WalSndKeepaliveIfNecessary() send keepalive message or not.
>
>
>
> After applying the Send-keepalive.patch and Add-some-logs-to-debug.patch, I
>
> found that the added message "send keep alive message" was not printed in
>
> publisher-side log.
>

It might be not reaching the actual send_keep_alive logic in
WalSndKeepaliveIfNecessary because of below code:
{
...
/*
* Don't send keepalive messages if timeouts are globally disabled or
* we're doing something not partaking in timeouts.
*/
if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
return;
..
}

I think you can add elog before the above return and before updating
progress in the below code:
case REORDER_BUFFER_CHANGE_INSERT:
  if (!relentry->pubactions.pubinsert)
+ {
+ OutputPluginUpdateProgress(ctx);
  return;

This will help us to rule out one possibility.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
> 
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
>   if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
>   return;
> 
> This will help us to rule out one possibility.

Thanks for your advices!

According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0

Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).

I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.

When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.

I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log

Attach the patches and test script mentioned above, in case someone wants to
try.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Thanks for your patch, it also works well when executing our use case, the timeout no longer appears in the logs. Is it necessary now to refine this patch and make as few changes as possible in order for it to be released?

On Fri, Jan 21, 2022 at 10:51 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
>
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
>   if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
>   return;
>
> This will help us to rule out one possibility.

Thanks for your advices!

According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0

Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).

I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.

When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.

I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log

Attach the patches and test script mentioned above, in case someone wants to
try.

Regards,
Wang wei

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
I keep your patch 0001 and I add these two calls in function WalSndUpdateProgress without modifying WalSndKeepaliveIfNecessary, it works too.
What do your think of this patch?

static void
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
{
        static TimestampTz sendTime = 0;
        TimestampTz now = GetCurrentTimestamp();

        ProcessRepliesIfAny();
        WalSndKeepaliveIfNecessary();



        /*
         * Track lag no more than once per WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS to
         * avoid flooding the lag tracker when we commit frequently.
         */
...
Regards

Fabrice

On Fri, Jan 21, 2022 at 2:17 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
Thanks for your patch, it also works well when executing our use case, the timeout no longer appears in the logs. Is it necessary now to refine this patch and make as few changes as possible in order for it to be released?

On Fri, Jan 21, 2022 at 10:51 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
>
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
>   if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
>   return;
>
> This will help us to rule out one possibility.

Thanks for your advices!

According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0

Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).

I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.

When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.

I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log

Attach the patches and test script mentioned above, in case someone wants to
try.

Regards,
Wang wei

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Jan 21, 2022 at 10:45 PM Fabrice Chapuis
<fabrice636861@gmail.com> wrote:
>
> I keep your patch 0001 and I add these two calls in function WalSndUpdateProgress without modifying
WalSndKeepaliveIfNecessary,it works too.
 
> What do your think of this patch?
>

I think this will also work. Here, the point was to just check what is
the exact problem and the possible approach to solve it, the actual
patch might be different from these ideas. So, let me try to summarize
the problem and the possible approach to solve it so that others can
also share their opinion.

Here, the problem is that we don't send keep-alive messages for a long
time while processing large transactions during logical replication
where we don't send any data of such transactions (say because the
table modified in the transaction is not published). We do try to send
the keep_alive if necessary at the end of the transaction (via
WalSndWriteData()) but by that time the subscriber-side can timeout
and exit.

Now, one idea to solve this problem could be that whenever we skip
sending any change we do try to update the plugin progress via
OutputPluginUpdateProgress(for walsender, it will invoke
WalSndUpdateProgress), and there it tries to process replies and send
keep_alive if necessary as we do when we send some data via
OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
don't know whether it is a good idea to invoke such a mechanism for
every change we skip to send or we should do it after we skip sending
some threshold of continuous changes. I think later would be
preferred. Also, we might want to introduce a new parameter
send_keep_alive to this API so that there is flexibility to invoke
this mechanism as we don't need to invoke it while we are actually
sending data and before that, we just update the progress via this
API.

Thoughts?

Note: I have added Simon and Petr J. to this thread as they introduced
the API OutputPluginUpdateProgress in commit 024711bb54 and know this
part of code/design well but ideas suggestions from everyone are
welcome.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thu, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Now, one idea to solve this problem could be that whenever we skip
> sending any change we do try to update the plugin progress via
> OutputPluginUpdateProgress(for walsender, it will invoke
> WalSndUpdateProgress), and there it tries to process replies and send
> keep_alive if necessary as we do when we send some data via
> OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> don't know whether it is a good idea to invoke such a mechanism for
> every change we skip to send or we should do it after we skip sending
> some threshold of continuous changes. I think later would be
> preferred. Also, we might want to introduce a new parameter
> send_keep_alive to this API so that there is flexibility to invoke
> this mechanism as we don't need to invoke it while we are actually
> sending data and before that, we just update the progress via this
> API.

I tried out the patch according to your advice.
I found if I invoke ProcessRepliesIfAny and WalSndKeepaliveIfNecessary in
function OutputPluginUpdateProgress, the running time of the newly added
function OutputPluginUpdateProgress invoked in pgoutput_change brings notable
overhead:
--11.34%--pgoutput_change
          |          
          |--8.94%--OutputPluginUpdateProgress
          |          |          
          |           --8.70%--WalSndUpdateProgress
          |                     |          
          |                     |--7.44%--ProcessRepliesIfAny

So I tried another way of sending keepalive message to the standby machine
based on the timeout without asking for a reply(see attachment), the running
time of the newly added function OutputPluginUpdateProgress invoked in
pgoutput_change also brings slight overhead:
--3.63%--pgoutput_change
          |          
          |--1.40%--get_rel_sync_entry
          |          |          
          |           --1.14%--hash_search
          |          
           --1.08%--OutputPluginUpdateProgress
                     |          
                      --0.85%--WalSndUpdateProgress

Based on above, I think the second idea that sending some threshold of
continuous changes might be better, I will do some research about this
approach.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Thanks for your new fix Wang.

TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime, wal_sender_timeout / 2);

shouldn't we use receiver_timeout in place of wal_sender_timeout because de problem comes from the consummer.

On Wed, Jan 26, 2022 at 4:37 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Thu, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Now, one idea to solve this problem could be that whenever we skip
> sending any change we do try to update the plugin progress via
> OutputPluginUpdateProgress(for walsender, it will invoke
> WalSndUpdateProgress), and there it tries to process replies and send
> keep_alive if necessary as we do when we send some data via
> OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> don't know whether it is a good idea to invoke such a mechanism for
> every change we skip to send or we should do it after we skip sending
> some threshold of continuous changes. I think later would be
> preferred. Also, we might want to introduce a new parameter
> send_keep_alive to this API so that there is flexibility to invoke
> this mechanism as we don't need to invoke it while we are actually
> sending data and before that, we just update the progress via this
> API.

I tried out the patch according to your advice.
I found if I invoke ProcessRepliesIfAny and WalSndKeepaliveIfNecessary in
function OutputPluginUpdateProgress, the running time of the newly added
function OutputPluginUpdateProgress invoked in pgoutput_change brings notable
overhead:
--11.34%--pgoutput_change
          |         
          |--8.94%--OutputPluginUpdateProgress
          |          |         
          |           --8.70%--WalSndUpdateProgress
          |                     |         
          |                     |--7.44%--ProcessRepliesIfAny

So I tried another way of sending keepalive message to the standby machine
based on the timeout without asking for a reply(see attachment), the running
time of the newly added function OutputPluginUpdateProgress invoked in
pgoutput_change also brings slight overhead:
--3.63%--pgoutput_change
          |         
          |--1.40%--get_rel_sync_entry
          |          |         
          |           --1.14%--hash_search
          |         
           --1.08%--OutputPluginUpdateProgress
                     |         
                      --0.85%--WalSndUpdateProgress

Based on above, I think the second idea that sending some threshold of
continuous changes might be better, I will do some research about this
approach.

Regards,
Wang wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Sat, Jan 28, 2022 at 19:36 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> shouldn't we use receiver_timeout in place of wal_sender_timeout because de
> problem comes from the consummer.
Thanks for your review.

IMO, because it is a bug fix on the publisher-side, and the keepalive message
is sent based on wal_sender_timeout in the existing code. So keep it consistent
with the existing code.

Regards,
Wang wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Now, one idea to solve this problem could be that whenever we skip
> > sending any change we do try to update the plugin progress via
> > OutputPluginUpdateProgress(for walsender, it will invoke
> > WalSndUpdateProgress), and there it tries to process replies and send
> > keep_alive if necessary as we do when we send some data via
> > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > don't know whether it is a good idea to invoke such a mechanism for
> > every change we skip to send or we should do it after we skip sending
> > some threshold of continuous changes. I think later would be
> > preferred. Also, we might want to introduce a new parameter
> > send_keep_alive to this API so that there is flexibility to invoke
> > this mechanism as we don't need to invoke it while we are actually
> > sending data and before that, we just update the progress via this
> > API.
> ......
> Based on above, I think the second idea that sending some threshold of
> continuous changes might be better, I will do some research about this
> approach.
Based on the second idea, I wrote a new patch(see attachment).

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

Thank you for making a patch.
I applied your patch and confirmed that codes passed regression test.
I put a short reviewing:

```
+    static int skipped_changes_count = 0;
+    /*
+     * Conservatively, at least 150,000 changes can be skipped in 1s.
+     *
+     * Because we use half of wal_sender_timeout as the threshold, and the unit
+     * of wal_sender_timeout in process is ms, the final threshold is
+     * wal_sender_timeout * 75.
+     */
+    int skipped_changes_threshold = wal_sender_timeout * 75;
```

I'm not sure but could you tell me the background of this calculation? 
Is this assumption reasonable?

```
@@ -654,20 +663,62 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
     {
         case REORDER_BUFFER_CHANGE_INSERT:
             if (!relentry->pubactions.pubinsert)
+            {
+                if (++skipped_changes_count >= skipped_changes_threshold)
+                {
+                    OutputPluginUpdateProgress(ctx, true);
+
+                    /*
+                     * After sending keepalive message, reset
+                     * skipped_changes_count.
+                     */
+                    skipped_changes_count = 0;
+                }
                 return;
+            }
             break;
```

Is the if-statement needed? In the walsender case OutputPluginUpdateProgress() leads WalSndUpdateProgress(),
and the function also has the threshold for ping-ing.

```
static void
-WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
+WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool send_keep_alive)
 {
-    static TimestampTz sendTime = 0;
+    static TimestampTz trackTime = 0;
     TimestampTz now = GetCurrentTimestamp();
 
+    if (send_keep_alive)
+    {
+        /*
+         * If half of wal_sender_timeout has lapsed without send message standby,
+         * send a keep-alive message to the standby.
+         */
+        static TimestampTz sendTime = 0;
+        TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime,
+                                            wal_sender_timeout / 2);
+        if (now >= ping_time)
+        {
+            WalSndKeepalive(false);
+
+            /* Try to flush pending output to the client */
+            if (pq_flush_if_writable() != 0)
+                WalSndShutdown();
+            sendTime = now;
+        }
+    }
+
```

* +1 about renaming to trackTime.
* `/2` might be magic number. How about following? Renaming is very welcome:

```
+#define WALSND_LOGICAL_PING_FACTOR     0.5
+               static TimestampTz sendTime = 0;
+               TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime,
+                                                                                       wal_sender_timeout *
WALSND_LOGICAL_PING_FACTOR)
```

Could you add a commitfest entry for cfbot?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Thanks for your patch, it works well in my test lab.
I added the definition extern in wal_sender_timeout; in the output_plugin.h file for compilation works.
I tested the patch for version 10 which is currently in production on our systems.
The functions below are only in master branch:
pgoutput_prepare_txn functions,
pgoutput_commit_prepared_txn,
pgoutput_rollback_prepared_txn,
pgoutput_stream_commit,
pgoutput_stream_prepare_txn

Will the patch be proposed retroactively to version 13-12-11-10.

Best regards,

Fabrice

On Tue, Feb 8, 2022 at 3:59 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Now, one idea to solve this problem could be that whenever we skip
> > sending any change we do try to update the plugin progress via
> > OutputPluginUpdateProgress(for walsender, it will invoke
> > WalSndUpdateProgress), and there it tries to process replies and send
> > keep_alive if necessary as we do when we send some data via
> > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > don't know whether it is a good idea to invoke such a mechanism for
> > every change we skip to send or we should do it after we skip sending
> > some threshold of continuous changes. I think later would be
> > preferred. Also, we might want to introduce a new parameter
> > send_keep_alive to this API so that there is flexibility to invoke
> > this mechanism as we don't need to invoke it while we are actually
> > sending data and before that, we just update the progress via this
> > API.
> ......
> Based on above, I think the second idea that sending some threshold of
> continuous changes might be better, I will do some research about this
> approach.
Based on the second idea, I wrote a new patch(see attachment).

Regards,
Wang wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tues, Feb 08, 2022 at 17:18 PM Kuroda, Hayato <kuroda.hayato@fujitsu.com> wrote:
> I applied your patch and confirmed that codes passed regression test.
> I put a short reviewing:
Thanks for your test and review.

> ```
> +    static int skipped_changes_count = 0;
> +    /*
> +     * Conservatively, at least 150,000 changes can be skipped in 1s.
> +     *
> +     * Because we use half of wal_sender_timeout as the threshold, and
> the unit
> +     * of wal_sender_timeout in process is ms, the final threshold is
> +     * wal_sender_timeout * 75.
> +     */
> +    int skipped_changes_threshold = wal_sender_timeout * 75;
> ```
> 
> I'm not sure but could you tell me the background of this calculation?
> Is this assumption reasonable?
According to our discussion, we need to send keepalive messages to subscriber
when skipping changes.
One approach is that **for each skipped change**, we try to send keepalive
message by calculating whether a timeout will occur based on the current time
and the last time the keepalive was sent. But this will brings slight overhead.
So I want to try another approach: after **constantly skipping some changes**,
we try to send keepalive message by calculating whether a timeout will occur
based on the current time and the last time the keepalive was sent.

IMO, we should send keepalive message after skipping a certain number of
changes constantly.
And I want to calculate the threshold dynamically by using a fixed value to
avoid adding too much code.
In addition, different users have different machine performance, and users can
modify wal_sender_timeout, so the threshold should be dynamically calculated
according to wal_sender_timeout.

Based on these, I have tested on machines with different configurations. I took
the test results on the machine with the lowest configuration.
[results]
The number of changes that can be skipped per second : 537087 (Average)
To be safe, I set the value to 150000.
(wal_sender_timeout / 2 / 1000 * 150000 = wal_sender_timeout * 75)

The spec of the test server to get the threshold is:
CPU information : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
Memert information : 816188 kB

> ```
> @@ -654,20 +663,62 @@ pgoutput_change(LogicalDecodingContext *ctx,
> ReorderBufferTXN *txn,
>      {
>          case REORDER_BUFFER_CHANGE_INSERT:
>              if (!relentry->pubactions.pubinsert)
> +            {
> +                if (++skipped_changes_count >=
> skipped_changes_threshold)
> +                {
> +                    OutputPluginUpdateProgress(ctx, true);
> +
> +                    /*
> +                     * After sending keepalive message,
> reset
> +                     * skipped_changes_count.
> +                     */
> +                    skipped_changes_count = 0;
> +                }
>                  return;
> +            }
>              break;
> ```
> 
> Is the if-statement needed? In the walsender case
> OutputPluginUpdateProgress() leads WalSndUpdateProgress(), and the
> function also has the threshold for ping-ing.
As mentioned above, we need to skip some changes continuously before
calculating whether it will time out.
If there is no if-statement here, every time a change is skipped, the timeout
will be checked. This brings extra overhead.

> ```
> static void
> -WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn,
> TransactionId xid)
> +WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn,
> +TransactionId xid, bool send_keep_alive)
>  {
> -    static TimestampTz sendTime = 0;
> +    static TimestampTz trackTime = 0;
>      TimestampTz now = GetCurrentTimestamp();
> 
> +    if (send_keep_alive)
> +    {
> +        /*
> +         * If half of wal_sender_timeout has lapsed without send
> message standby,
> +         * send a keep-alive message to the standby.
> +         */
> +        static TimestampTz sendTime = 0;
> +        TimestampTz ping_time =
> TimestampTzPlusMilliseconds(sendTime,
> +
>     wal_sender_timeout / 2);
> +        if (now >= ping_time)
> +        {
> +            WalSndKeepalive(false);
> +
> +            /* Try to flush pending output to the client */
> +            if (pq_flush_if_writable() != 0)
> +                WalSndShutdown();
> +            sendTime = now;
> +        }
> +    }
> +
> ```
> 
> * +1 about renaming to trackTime.
> * `/2` might be magic number. How about following? Renaming is very welcome:
> 
> ```
> +#define WALSND_LOGICAL_PING_FACTOR     0.5
> +               static TimestampTz sendTime = 0;
> +               TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime,
> +
> +wal_sender_timeout * WALSND_LOGICAL_PING_FACTOR)
> ```
In the existing code, similar operations on wal_sender_timeout use the style of
(wal_sender_timeout / 2), e.g. function WalSndKeepaliveIfNecessary. So I think
it should be consistent in this patch.
But I think it is better to use magic number too. Maybe we could improve it in
a new thread.

> Could you add a commitfest entry for cfbot?
Thanks for the reminder, I will add it soon.


Regards,
Wang wei

Re: Logical replication timeout problem

От
Ajin Cherian
Дата:
On Tue, Feb 8, 2022 at 1:59 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> > On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Now, one idea to solve this problem could be that whenever we skip
> > > sending any change we do try to update the plugin progress via
> > > OutputPluginUpdateProgress(for walsender, it will invoke
> > > WalSndUpdateProgress), and there it tries to process replies and send
> > > keep_alive if necessary as we do when we send some data via
> > > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > > don't know whether it is a good idea to invoke such a mechanism for
> > > every change we skip to send or we should do it after we skip sending
> > > some threshold of continuous changes. I think later would be
> > > preferred. Also, we might want to introduce a new parameter
> > > send_keep_alive to this API so that there is flexibility to invoke
> > > this mechanism as we don't need to invoke it while we are actually
> > > sending data and before that, we just update the progress via this
> > > API.
> > ......
> > Based on above, I think the second idea that sending some threshold of
> > continuous changes might be better, I will do some research about this
> > approach.
> Based on the second idea, I wrote a new patch(see attachment).

Hi Wang,

Some comments:
 I see you only track skipped Inserts/Updates and Deletes. What about
DDL operations that are skipped, what about truncate.
What about changes made to unpublished tables? I wonder if you could
create a test script that only did DDL operations
and truncates, would this timeout happen?

regards,
Ajin Cherian
Fujitsu Australia



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
> Some comments:
Thanks for your review.

>  I see you only track skipped Inserts/Updates and Deletes. What about
> DDL operations that are skipped, what about truncate.
> What about changes made to unpublished tables? I wonder if you could
> create a test script that only did DDL operations
> and truncates, would this timeout happen?
According to your suggestion, I tested with DDL and truncate.
While testing, I ran only 20,000 DDLs and 10,000 truncations in one
transaction.
If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out.
And if I use the default values, it will not time out.
IMHO there should not be long transactions that only contain DDL and
truncation. I'm not quite sure, do we need to handle this kind of use case?

Attach the test details.
[publisher-side]
configure:
    wal_sender_timeout = 30s or 60s
    wal_receiver_timeout = 30s or 60s
sql:
    create table tbl (a int primary key, b text);
    create table tbl2 (a int primary key, b text);
    create publication pub for table tbl;

[subscriber-side]
configure:
    wal_sender_timeout = 30s or 60s
    wal_receiver_timeout = 30s or 60s
sql:
    create table tbl (a int primary key, b text);"
    create subscription sub connection 'dbname=postgres user=postgres' publication pub;

[Execute sql in publisher-side]
In a transaction, execute the following SQL 10,000 times in a loop:
    alter table tbl2 rename column b to c;
    truncate table tbl2;
    alter table tbl2 rename column c to b;


Regards,
Wang wei

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Feb 22, 2022 at 9:17 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > Some comments:
> Thanks for your review.
>
> >  I see you only track skipped Inserts/Updates and Deletes. What about
> > DDL operations that are skipped, what about truncate.
> > What about changes made to unpublished tables? I wonder if you could
> > create a test script that only did DDL operations
> > and truncates, would this timeout happen?
> According to your suggestion, I tested with DDL and truncate.
> While testing, I ran only 20,000 DDLs and 10,000 truncations in one
> transaction.
> If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out.
> And if I use the default values, it will not time out.
> IMHO there should not be long transactions that only contain DDL and
> truncation. I'm not quite sure, do we need to handle this kind of use case?
>

I think it is better to handle such cases as well and changes related
to unpublished tables as well. BTW, it seems Kuroda-San has also given
some comments [1] which I am not sure are addressed.

I think instead of keeping the skipping threshold w.r.t
wal_sender_timeout, we can use some conservative number like 10000 or
so which we are sure won't impact performance and won't lead to
timeouts.

*
+ /*
+ * skipped_changes_count is reset when processing changes that do not need to
+ * be skipped.
+ */
+ skipped_changes_count = 0

When the skipped_changes_count is reset, the sendTime should also be
reset. Can we reset it whenever the UpdateProgress function is called
with send_keep_alive as false?

[1] -
https://www.postgresql.org/message-id/TYAPR01MB5866BD2248EF82FF432FE599F52D9%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

Thank you for teaching some backgrounds about the patch.

> According to our discussion, we need to send keepalive messages to subscriber
> when skipping changes.
> One approach is that **for each skipped change**, we try to send keepalive
> message by calculating whether a timeout will occur based on the current time
> and the last time the keepalive was sent. But this will brings slight overhead.
> So I want to try another approach: after **constantly skipping some changes**,
> we try to send keepalive message by calculating whether a timeout will occur
> based on the current time and the last time the keepalive was sent.

You meant that calling system calls like GetCurrentTimestamp() should be reduced,
right? I'm not sure how it affects but it seems reasonable.

> IMO, we should send keepalive message after skipping a certain number of
> changes constantly.
> And I want to calculate the threshold dynamically by using a fixed value to
> avoid adding too much code.
> In addition, different users have different machine performance, and users can
> modify wal_sender_timeout, so the threshold should be dynamically calculated
> according to wal_sender_timeout.

Your experiment seems not bad, but the background cannot be understand from
code comments. I prefer a static threshold because it's more simple,
which as Amit said in the following too:

https://www.postgresql.org/message-id/CAA4eK1%2B-p_K_j%3DNiGGD6tCYXiJH0ypT4REX5PBKJ4AcUoF3gZQ%40mail.gmail.com

> In the existing code, similar operations on wal_sender_timeout use the style of
> (wal_sender_timeout / 2), e.g. function WalSndKeepaliveIfNecessary. So I think
> it should be consistent in this patch.
> But I think it is better to use magic number too. Maybe we could improve it in
> a new thread.

I confirmed the code and +1 yours. We should treat it in another thread if needed.

BTW, this patch cannot be applied to current master.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Feb 22, 2022 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
Thanks for your review.

> On Tue, Feb 22, 2022 at 9:17 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > Some comments:
> > Thanks for your review.
> >
> > >  I see you only track skipped Inserts/Updates and Deletes. What about
> > > DDL operations that are skipped, what about truncate.
> > > What about changes made to unpublished tables? I wonder if you could
> > > create a test script that only did DDL operations
> > > and truncates, would this timeout happen?
> > According to your suggestion, I tested with DDL and truncate.
> > While testing, I ran only 20,000 DDLs and 10,000 truncations in one
> > transaction.
> > If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out.
> > And if I use the default values, it will not time out.
> > IMHO there should not be long transactions that only contain DDL and
> > truncation. I'm not quite sure, do we need to handle this kind of use case?
> >
> 
> I think it is better to handle such cases as well and changes related
> to unpublished tables as well. BTW, it seems Kuroda-San has also given
> some comments [1] which I am not sure are addressed.
Add handling of related use cases.

> I think instead of keeping the skipping threshold w.r.t
> wal_sender_timeout, we can use some conservative number like 10000 or
> so which we are sure won't impact performance and won't lead to
> timeouts.
Yes, it would be better. Set the threshold conservatively to 10000.

> *
> + /*
> + * skipped_changes_count is reset when processing changes that do not need
> to
> + * be skipped.
> + */
> + skipped_changes_count = 0
> 
> When the skipped_changes_count is reset, the sendTime should also be
> reset. Can we reset it whenever the UpdateProgress function is called
> with send_keep_alive as false?
Fixed.

Attached a new patch that addresses following improvements I have got so far as
comments:
1. Consider other changes that need to be skipped(truncate, DDL and function
calls pg_logical_emit_message). [suggestion by Ajin, Amit]
(Add a new function SendKeepaliveIfNecessary for trying to send keepalive message.)
2. Set the threshold conservatively to a static value of 10000.[suggestion by Amit, Kuroda-San]
3. Reset sendTime in function WalSndUpdateProgress when send_keep_alive is
false. [suggestion by Amit]

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thur, Feb 24, 2022 at 4:06 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> Dear Wang,
Thanks for your review.

> > According to our discussion, we need to send keepalive messages to
> > subscriber when skipping changes.
> > One approach is that **for each skipped change**, we try to send
> > keepalive message by calculating whether a timeout will occur based on
> > the current time and the last time the keepalive was sent. But this will brings
> slight overhead.
> > So I want to try another approach: after **constantly skipping some
> > changes**, we try to send keepalive message by calculating whether a
> > timeout will occur based on the current time and the last time the keepalive
> was sent.
> 
> You meant that calling system calls like GetCurrentTimestamp() should be
> reduced, right? I'm not sure how it affects but it seems reasonable.
Yes. There is no need to invoke frequently, and it will bring overhead.

> > IMO, we should send keepalive message after skipping a certain number
> > of changes constantly.
> > And I want to calculate the threshold dynamically by using a fixed
> > value to avoid adding too much code.
> > In addition, different users have different machine performance, and
> > users can modify wal_sender_timeout, so the threshold should be
> > dynamically calculated according to wal_sender_timeout.
> 
> Your experiment seems not bad, but the background cannot be understand
> from code comments. I prefer a static threshold because it's more simple, which
> as Amit said in the following too:
> 
> https://www.postgresql.org/message-id/CAA4eK1%2B-
> p_K_j%3DNiGGD6tCYXiJH0ypT4REX5PBKJ4AcUoF3gZQ%40mail.gmail.com
Yes, you are right. Fixed.
And I set the threshold to 10000.

> BTW, this patch cannot be applied to current master.
Thanks for reminder. Rebase it.
Kindly have a look at new patch shared in [1].

[1]
https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

> Attached a new patch that addresses following improvements I have got so far as
> comments:
> 1. Consider other changes that need to be skipped(truncate, DDL and function
> calls pg_logical_emit_message). [suggestion by Ajin, Amit]
> (Add a new function SendKeepaliveIfNecessary for trying to send keepalive
> message.)
> 2. Set the threshold conservatively to a static value of 10000.[suggestion by Amit,
> Kuroda-San]
> 3. Reset sendTime in function WalSndUpdateProgress when send_keep_alive is
> false. [suggestion by Amit]

Thank you for giving a good patch! I'll check more detail later,
but it can be applied my codes and passed check world.
I put some minor comments:

```
+ * Try to send keepalive message
```

Maybe missing "a"?

```
+       /*
+       * After continuously skipping SKIPPED_CHANGES_THRESHOLD changes, try to send a
+       * keepalive message.
+       */
```

This comments does not follow preferred style:
https://www.postgresql.org/docs/devel/source-format.html

```
@@ -683,12 +683,12 @@ OutputPluginWrite(struct LogicalDecodingContext *ctx, bool last_write)
  * Update progress tracking (if supported).
  */
 void
-OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx)
+OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool send_keep_alive)
```

This function is no longer doing just tracking.
Could you update the code comment above?

```
    if (!is_publishable_relation(relation))
        return;
```

I'm not sure but it seems that the function exits immediately if relation
is a sequence, view, temporary table and so on. Is it OK? Does it never happen?

```
+       SendKeepaliveIfNecessary(ctx, false);
```

I think a comment is needed above which clarifies sending a keepalive message.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Feb 28, 2022 at 6:58 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> Dear Wang,
> 
> > Attached a new patch that addresses following improvements I have got
> > so far as
> > comments:
> > 1. Consider other changes that need to be skipped(truncate, DDL and
> > function calls pg_logical_emit_message). [suggestion by Ajin, Amit]
> > (Add a new function SendKeepaliveIfNecessary for trying to send
> > keepalive
> > message.)
> > 2. Set the threshold conservatively to a static value of
> > 10000.[suggestion by Amit, Kuroda-San] 3. Reset sendTime in function
> > WalSndUpdateProgress when send_keep_alive is false. [suggestion by
> > Amit]
> 
> Thank you for giving a good patch! I'll check more detail later, but it can be
> applied my codes and passed check world.
> I put some minor comments:
Thanks for your comments.

> ```
> + * Try to send keepalive message
> ```
> 
> Maybe missing "a"?
Fixed. Add missing "a".

> ```
> +       /*
> +       * After continuously skipping SKIPPED_CHANGES_THRESHOLD changes, try
> to send a
> +       * keepalive message.
> +       */
> ```
> 
> This comments does not follow preferred style:
> https://www.postgresql.org/docs/devel/source-format.html
Fixed. Correct wrong comment style.

> ```
> @@ -683,12 +683,12 @@ OutputPluginWrite(struct LogicalDecodingContext *ctx,
> bool last_write)
>   * Update progress tracking (if supported).
>   */
>  void
> -OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx)
> +OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool
> +send_keep_alive)
> ```
> 
> This function is no longer doing just tracking.
> Could you update the code comment above?
Fixed. Update the comment above function OutputPluginUpdateProgress.

> ```
>     if (!is_publishable_relation(relation))
>         return;
> ```
> 
> I'm not sure but it seems that the function exits immediately if relation is a
> sequence, view, temporary table and so on. Is it OK? Does it never happen?
I did some checks to confirm this. After my confirmation, there are several
situations that can cause a timeout. For example, if I insert many date into
table sql_features in a long transaction, subscriber-side will time out.
Although I think users should not modify these tables arbitrarily, it could
happen. To be conservative, I think this use case should be addressed as well.
Fixed. Invoke function SendKeepaliveIfNecessary before return.

> ```
> +       SendKeepaliveIfNecessary(ctx, false);
> ```
> 
> I think a comment is needed above which clarifies sending a keepalive message.
Fixed. Before invoking function SendKeepaliveIfNecessary, add the corresponding
comment.

Attach the new patch. [suggestion by Kuroda-San]
1. Fix the typo.
2. Improve comment style.
3. Fix missing consideration.
4. Add comments to clarifies above functions and function calls.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Peter Smith
Дата:
On Wed, Mar 2, 2022 at 1:06 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
...
> Attach the new patch. [suggestion by Kuroda-San]

It is difficult to read the thread and to keep track of who reviewed
what, and what patch is latest etc, when every patch name is the same.

Can you please introduce a version number for future patch attachments?

------
KInd Regards,
Peter Smith.
Fujitsu Australia



RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

> Attach the new patch. [suggestion by Kuroda-San]
> 1. Fix the typo.
> 2. Improve comment style.
> 3. Fix missing consideration.
> 4. Add comments to clarifies above functions and function calls.

Thank you for updating the patch! I confirmed they were fixed.

```
                                case REORDER_BUFFER_CHANGE_INVALIDATION:
-                                       /* Execute the invalidation messages locally */
-                                       ReorderBufferExecuteInvalidations(
-
change->data.inval.ninvalidations,
-
change->data.inval.invalidations);
-                                       break;
+                                       {
+                                               LogicalDecodingContext *ctx = rb->private_data;
+
+                                               Assert(!ctx->fast_forward);
+
+                                               /* Set output state. */
+                                               ctx->accept_writes = true;
+                                               ctx->write_xid = txn->xid;
+                                               ctx->write_location = change->lsn;
```

Some codes were added in ReorderBufferProcessTXN() for treating DDL, 




I'm also happy if you give the version number :-).


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

> -----Original Message-----
> From: Wang, Wei/王 威 <wangw.fnst@fujitsu.com>
> Sent: Wednesday, March 2, 2022 11:06 AM
> To: Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com>
> Cc: Fabrice Chapuis <fabrice636861@gmail.com>; Simon Riggs
> <simon.riggs@enterprisedb.com>; Petr Jelinek
> <petr.jelinek@enterprisedb.com>; Tang, Haiying/唐 海英
> <tanghy.fnst@fujitsu.com>; Amit Kapila <amit.kapila16@gmail.com>;
> PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Ajin Cherian
> <itsajin@gmail.com>
> Subject: RE: Logical replication timeout problem
> 
> On Mon, Feb 28, 2022 at 6:58 PM Kuroda, Hayato/黒田 隼人
> <kuroda.hayato@fujitsu.com> wrote:
> > Dear Wang,
> >
> > > Attached a new patch that addresses following improvements I have got
> > > so far as
> > > comments:
> > > 1. Consider other changes that need to be skipped(truncate, DDL and
> > > function calls pg_logical_emit_message). [suggestion by Ajin, Amit]
> > > (Add a new function SendKeepaliveIfNecessary for trying to send
> > > keepalive
> > > message.)
> > > 2. Set the threshold conservatively to a static value of
> > > 10000.[suggestion by Amit, Kuroda-San] 3. Reset sendTime in function
> > > WalSndUpdateProgress when send_keep_alive is false. [suggestion by
> > > Amit]
> >
> > Thank you for giving a good patch! I'll check more detail later, but it can be
> > applied my codes and passed check world.
> > I put some minor comments:
> Thanks for your comments.
> 
> > ```
> > + * Try to send keepalive message
> > ```
> >
> > Maybe missing "a"?
> Fixed. Add missing "a".
> 
> > ```
> > +       /*
> > +       * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> changes, try
> > to send a
> > +       * keepalive message.
> > +       */
> > ```
> >
> > This comments does not follow preferred style:
> > https://www.postgresql.org/docs/devel/source-format.html
> Fixed. Correct wrong comment style.
> 
> > ```
> > @@ -683,12 +683,12 @@ OutputPluginWrite(struct LogicalDecodingContext
> *ctx,
> > bool last_write)
> >   * Update progress tracking (if supported).
> >   */
> >  void
> > -OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx)
> > +OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool
> > +send_keep_alive)
> > ```
> >
> > This function is no longer doing just tracking.
> > Could you update the code comment above?
> Fixed. Update the comment above function OutputPluginUpdateProgress.
> 
> > ```
> >     if (!is_publishable_relation(relation))
> >         return;
> > ```
> >
> > I'm not sure but it seems that the function exits immediately if relation is a
> > sequence, view, temporary table and so on. Is it OK? Does it never happen?
> I did some checks to confirm this. After my confirmation, there are several
> situations that can cause a timeout. For example, if I insert many date into
> table sql_features in a long transaction, subscriber-side will time out.
> Although I think users should not modify these tables arbitrarily, it could
> happen. To be conservative, I think this use case should be addressed as well.
> Fixed. Invoke function SendKeepaliveIfNecessary before return.
> 
> > ```
> > +       SendKeepaliveIfNecessary(ctx, false);
> > ```
> >
> > I think a comment is needed above which clarifies sending a keepalive
> message.
> Fixed. Before invoking function SendKeepaliveIfNecessary, add the
> corresponding
> comment.
> 
> Attach the new patch. [suggestion by Kuroda-San]
> 1. Fix the typo.
> 2. Improve comment style.
> 3. Fix missing consideration.
> 4. Add comments to clarifies above functions and function calls.
> 
> Regards,
> Wang wei

RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

> Some codes were added in ReorderBufferProcessTXN() for treating DDL,

My mailer went wrong, so I'll put comments again. Sorry.

Some codes were added in ReorderBufferProcessTXN() for treating DDL,
but I doubted updating accept_writes is needed.
IMU, the parameter is read by OutputPluginPrepareWrite() in order align messages.
They should have a header - like 'w' - before their body. But here only a keepalive message is sent,
no meaningful changes, so I think it might be not needed.
I commented out the line and tested like you did [1], and no timeout and errors were found.
Do you have any reasons for that?

https://www.postgresql.org/message-id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Mar 4, 2022 at 4:26 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
>
Thanks for your test and comments.

> Some codes were added in ReorderBufferProcessTXN() for treating DDL,
> but I doubted updating accept_writes is needed.
> IMU, the parameter is read by OutputPluginPrepareWrite() in order align
> messages.
> They should have a header - like 'w' - before their body. But here only a
> keepalive message is sent,
> no meaningful changes, so I think it might be not needed.
> I commented out the line and tested like you did [1], and no timeout and errors
> were found.
> Do you have any reasons for that?
> 
> https://www.postgresql.org/message-
> id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd0
> 1.prod.outlook.com
Yes, you are right. We should not set accept_writes to true here.
And I looked into the function WalSndUpdateProgress. I found function
WalSndUpdateProgress try to record the time of some message(by function
LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn.
Then, when publisher receives the reply message from the subscriber(function
ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate the
delay time(refer to view pg_stat_replication).
Referring to the purpose of LagTrackerWrite, I think it is no need to log time
when sending keepalive messages here.
So when the parameter send_keep_alive of function WalSndUpdateProgress is true,
skip the recording time.

> I'm also happy if you give the version number :-).
Introduce version information, starting from version 1.

Attach the new patch.
1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.]
2. Introduce version information.[suggestion by Peter, Kuroda-San]

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Ajin Cherian
Дата:
On Tue, Mar 8, 2022 at 12:25 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
> Attach the new patch.
> 1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.]
> 2. Introduce version information.[suggestion by Peter, Kuroda-San]
>
> Regards,
> Wang wei

Some comments.

1. The comment  on top of SendKeepaliveIfNecessary

 Try to send a keepalive message if too many changes was skipped.

change to

Try to send a keepalive message if too many changes wer skipped.

2. In pgoutput_change:

+ /* Reset the counter for skipped changes. */
+ SendKeepaliveIfNecessary(ctx, false);
+

This reset is called too early, this function might go on to skip
changes because of the row filter, so this
reset fits better once we know for sure that a change is sent out. You
will also need to send keep alive
when the change is skipped due to the row filter.

regards,
Ajin Cherian
Fujitsu Australia



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
Hi,

On Tue, Mar 8, 2022 at 10:25 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Fri, Mar 4, 2022 at 4:26 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> Thanks for your test and comments.
>
> > Some codes were added in ReorderBufferProcessTXN() for treating DDL,
> > but I doubted updating accept_writes is needed.
> > IMU, the parameter is read by OutputPluginPrepareWrite() in order align
> > messages.
> > They should have a header - like 'w' - before their body. But here only a
> > keepalive message is sent,
> > no meaningful changes, so I think it might be not needed.
> > I commented out the line and tested like you did [1], and no timeout and errors
> > were found.
> > Do you have any reasons for that?
> >
> > https://www.postgresql.org/message-
> > id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd0
> > 1.prod.outlook.com
> Yes, you are right. We should not set accept_writes to true here.
> And I looked into the function WalSndUpdateProgress. I found function
> WalSndUpdateProgress try to record the time of some message(by function
> LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn.
> Then, when publisher receives the reply message from the subscriber(function
> ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate the
> delay time(refer to view pg_stat_replication).
> Referring to the purpose of LagTrackerWrite, I think it is no need to log time
> when sending keepalive messages here.
> So when the parameter send_keep_alive of function WalSndUpdateProgress is true,
> skip the recording time.
>
> > I'm also happy if you give the version number :-).
> Introduce version information, starting from version 1.
>
> Attach the new patch.
> 1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.]
> 2. Introduce version information.[suggestion by Peter, Kuroda-San]

I've looked at the patch and have a question:

+void
+SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped)
+{
+        static int skipped_changes_count = 0;
+
+        /*
+         * skipped_changes_count is reset when processing changes that do not
+         * need to be skipped.
+         */
+        if (!skipped)
+        {
+                skipped_changes_count = 0;
+                return;
+        }
+
+        /*
+         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
changes, try to send a
+         * keepalive message.
+         */
+        #define SKIPPED_CHANGES_THRESHOLD 10000
+
+        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
+        {
+                /* Try to send a keepalive message. */
+                OutputPluginUpdateProgress(ctx, true);
+
+                /* After trying to send a keepalive message, reset the flag. */
+                skipped_changes_count = 0;
+        }
+}

Since we send a keepalive after continuously skipping 10000 changes,
the originally reported issue can still occur if skipping 10000
changes took more than the timeout and the walsender didn't send any
change while that, is that right?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

Thank you for updating the patch! Good self-reviewing.

> And I looked into the function WalSndUpdateProgress. I found function
> WalSndUpdateProgress try to record the time of some message(by function
> LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn.

Yeah, I think you are right.

> Then, when publisher receives the reply message from the subscriber(function
> ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate
> the
> delay time(refer to view pg_stat_replication).
> Referring to the purpose of LagTrackerWrite, I think it is no need to log time
> when sending keepalive messages here.
> So when the parameter send_keep_alive of function WalSndUpdateProgress is
> true,
> skip the recording time.

I also read them. LagTracker records the elapsed time between sending commit
from publisher and receiving reply from subscriber, right? It seems good.

Do we need adding a test for them? I think it can be added to 100_bugs.pl.
Actually I tried to send PoC, but it does not finish to implement that.
I'll send if it is done.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tues, Mar 8, 2022 at 11:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
> Some comments.
Thanks for your comments.

> 1. The comment  on top of SendKeepaliveIfNecessary
> 
>  Try to send a keepalive message if too many changes was skipped.
> 
> change to
> 
> Try to send a keepalive message if too many changes wer skipped.
Fixed. Change 'was' to 'were'.

> 2. In pgoutput_change:
> 
> + /* Reset the counter for skipped changes. */
> + SendKeepaliveIfNecessary(ctx, false);
> +
> 
> This reset is called too early, this function might go on to skip
> changes because of the row filter, so this
> reset fits better once we know for sure that a change is sent out. You
> will also need to send keep alive
> when the change is skipped due to the row filter.
Fixed. Add a flag 'is_send' to record whether the change is sent, then reset
the counter or try to send a keepalive message based on the flag 'is_send'.

Attach the new patch.
1. Fix typo in comment on top of SendKeepaliveIfNecessary.[suggestion by Ajin.]
2. Add handling of cases filtered out by row filter.[suggestion by Ajin.]

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've looked at the patch and have a question:
Thanks for your review and comments.

> +void
> +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> +        static int skipped_changes_count = 0;
> +
> +        /*
> +         * skipped_changes_count is reset when processing changes that do not
> +         * need to be skipped.
> +         */
> +        if (!skipped)
> +        {
> +                skipped_changes_count = 0;
> +                return;
> +        }
> +
> +        /*
> +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> changes, try to send a
> +         * keepalive message.
> +         */
> +        #define SKIPPED_CHANGES_THRESHOLD 10000
> +
> +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> +        {
> +                /* Try to send a keepalive message. */
> +                OutputPluginUpdateProgress(ctx, true);
> +
> +                /* After trying to send a keepalive message, reset the flag. */
> +                skipped_changes_count = 0;
> +        }
> +}
> 
> Since we send a keepalive after continuously skipping 10000 changes, the
> originally reported issue can still occur if skipping 10000 changes took more than
> the timeout and the walsender didn't send any change while that, is that right?
Yes, theoretically so.
But after testing, I think this value should be conservative enough not to reproduce
this bug.
After the previous discussion[1], it is currently considered that it is better
to directly set a conservative threshold than to calculate the threshold based
on wal_sender_timeout.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tue, Mar 8, 2022 at 4:48 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> Thank you for updating the patch! Good self-reviewing.
Thanks for your comments.

> > And I looked into the function WalSndUpdateProgress. I found function
> > WalSndUpdateProgress try to record the time of some message(by
> > function
> > LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn.
> 
> Yeah, I think you are right.
> 
> > Then, when publisher receives the reply message from the
> > subscriber(function ProcessStandbyReplyMessage), publisher invokes
> > LagTrackerRead to calculate the delay time(refer to view
> > pg_stat_replication).
> > Referring to the purpose of LagTrackerWrite, I think it is no need to
> > log time when sending keepalive messages here.
> > So when the parameter send_keep_alive of function WalSndUpdateProgress
> > is true, skip the recording time.
> 
> I also read them. LagTracker records the elapsed time between sending commit
> from publisher and receiving reply from subscriber, right? It seems good.
Yes.

> Do we need adding a test for them? I think it can be added to 100_bugs.pl.
> Actually I tried to send PoC, but it does not finish to implement that.
> I'll send if it is done.
I'm not sure if it is worth it.
Because the reproduced test of this bug might take some time and might risk
making the build farm slow, so I am not sure if others would like the
reproduced test of this bug.

Regards,
Wang wei

RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang,

Thank you for updating!

> > Do we need adding a test for them? I think it can be added to 100_bugs.pl.
> > Actually I tried to send PoC, but it does not finish to implement that.
> > I'll send if it is done.
> I'm not sure if it is worth it.
> Because the reproduced test of this bug might take some time and might risk
> making the build farm slow, so I am not sure if others would like the
> reproduced test of this bug.

I was taught from you that it may suggest that it is difficult to stabilize and
minimize the test. I withdraw the above.
I put some comments for v2, mainly cosmetic ones.

1. pgoutput_change
```
+       bool is_send = true;
```

My first impression is that is_send should be initialized to false,
and it will change to true when OutputPluginWrite() is called.


2. pgoutput_change
```
+                               {
+                                       is_send = false;
+                                       break;
+                               }
```

Here are too many indents, but I think they should be removed.
See above comment.

3. WalSndUpdateProgress
```
+               /*
+                * If half of wal_sender_timeout has lapsed without send message standby,
+                * send a keep-alive message to the standby.
+                */
```

The comment seems inconsistency with others.
Here is "keep-alive", but other parts are "keepalive".

4. ReorderBufferProcessTXN
```
+
change->data.inval.ninvalidations,
+
change->data.inval.invalidations);
```

Maybe these lines break 80-columns rule.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, Mar 9, 2022 at 11:26 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've looked at the patch and have a question:
> Thanks for your review and comments.
>
> > +void
> > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > +        static int skipped_changes_count = 0;
> > +
> > +        /*
> > +         * skipped_changes_count is reset when processing changes that do not
> > +         * need to be skipped.
> > +         */
> > +        if (!skipped)
> > +        {
> > +                skipped_changes_count = 0;
> > +                return;
> > +        }
> > +
> > +        /*
> > +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > changes, try to send a
> > +         * keepalive message.
> > +         */
> > +        #define SKIPPED_CHANGES_THRESHOLD 10000
> > +
> > +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > +        {
> > +                /* Try to send a keepalive message. */
> > +                OutputPluginUpdateProgress(ctx, true);
> > +
> > +                /* After trying to send a keepalive message, reset the flag. */
> > +                skipped_changes_count = 0;
> > +        }
> > +}
> >
> > Since we send a keepalive after continuously skipping 10000 changes, the
> > originally reported issue can still occur if skipping 10000 changes took more than
> > the timeout and the walsender didn't send any change while that, is that right?
> Yes, theoretically so.
> But after testing, I think this value should be conservative enough not to reproduce
> this bug.

But it really depends on the workload, the server condition, and the
timeout value, right? The logical decoding might involve disk I/O much
to spill/load intermediate data and the system might be under the
high-load condition. Why don't we check both the count and the time?
That is, I think we can send a keep-alive either if we skipped 10000
changes or if we didn't sent anything for wal_sender_timeout / 2.

Also, the patch changes the current behavior of wal senders; with the
patch, we send keep-alive messages even when wal_sender_timeout = 0.
But I'm not sure it's a good idea. The subscriber's
wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
I think it's better to periodically check replies and send a reply to
the keep-alive message sent from the subscriber if necessary, for
example, every 10000 skipped changes.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Björn Harrtell
Дата:
Hi, I have been following this discussion for a while because I believe we are hit by this pretty hard.

This sounds very reasonable to me:

"Why don't we check both the count and the time?
That is, I think we can send a keep-alive either if we skipped 10000
changes or if we didn't sent anything for wal_sender_timeout / 2"

Will gladly test what ends up as an acceptable patch for this, hoping for the best and thanks for looking into this.

Den ons 9 mars 2022 kl 07:45 skrev Masahiko Sawada <sawada.mshk@gmail.com>:
On Wed, Mar 9, 2022 at 11:26 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've looked at the patch and have a question:
> Thanks for your review and comments.
>
> > +void
> > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > +        static int skipped_changes_count = 0;
> > +
> > +        /*
> > +         * skipped_changes_count is reset when processing changes that do not
> > +         * need to be skipped.
> > +         */
> > +        if (!skipped)
> > +        {
> > +                skipped_changes_count = 0;
> > +                return;
> > +        }
> > +
> > +        /*
> > +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > changes, try to send a
> > +         * keepalive message.
> > +         */
> > +        #define SKIPPED_CHANGES_THRESHOLD 10000
> > +
> > +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > +        {
> > +                /* Try to send a keepalive message. */
> > +                OutputPluginUpdateProgress(ctx, true);
> > +
> > +                /* After trying to send a keepalive message, reset the flag. */
> > +                skipped_changes_count = 0;
> > +        }
> > +}
> >
> > Since we send a keepalive after continuously skipping 10000 changes, the
> > originally reported issue can still occur if skipping 10000 changes took more than
> > the timeout and the walsender didn't send any change while that, is that right?
> Yes, theoretically so.
> But after testing, I think this value should be conservative enough not to reproduce
> this bug.

But it really depends on the workload, the server condition, and the
timeout value, right? The logical decoding might involve disk I/O much
to spill/load intermediate data and the system might be under the
high-load condition. Why don't we check both the count and the time?
That is, I think we can send a keep-alive either if we skipped 10000
changes or if we didn't sent anything for wal_sender_timeout / 2.

Also, the patch changes the current behavior of wal senders; with the
patch, we send keep-alive messages even when wal_sender_timeout = 0.
But I'm not sure it's a good idea. The subscriber's
wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
I think it's better to periodically check replies and send a reply to
the keep-alive message sent from the subscriber if necessary, for
example, every 10000 skipped changes.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Mar 9, 2022 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
Thanks for your comments.

> On Wed, Mar 9, 2022 at 10:26 AM I wrote:
> > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> > > I've looked at the patch and have a question:
> > Thanks for your review and comments.
> >
> > > +void
> > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > > +        static int skipped_changes_count = 0;
> > > +
> > > +        /*
> > > +         * skipped_changes_count is reset when processing changes that do
> not
> > > +         * need to be skipped.
> > > +         */
> > > +        if (!skipped)
> > > +        {
> > > +                skipped_changes_count = 0;
> > > +                return;
> > > +        }
> > > +
> > > +        /*
> > > +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > > changes, try to send a
> > > +         * keepalive message.
> > > +         */
> > > +        #define SKIPPED_CHANGES_THRESHOLD 10000
> > > +
> > > +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > > +        {
> > > +                /* Try to send a keepalive message. */
> > > +                OutputPluginUpdateProgress(ctx, true);
> > > +
> > > +                /* After trying to send a keepalive message, reset the flag. */
> > > +                skipped_changes_count = 0;
> > > +        }
> > > +}
> > >
> > > Since we send a keepalive after continuously skipping 10000 changes, the
> > > originally reported issue can still occur if skipping 10000 changes took more
> than
> > > the timeout and the walsender didn't send any change while that, is that
> right?
> > Yes, theoretically so.
> > But after testing, I think this value should be conservative enough not to
> reproduce
> > this bug.
> 
> But it really depends on the workload, the server condition, and the
> timeout value, right? The logical decoding might involve disk I/O much
> to spill/load intermediate data and the system might be under the
> high-load condition. Why don't we check both the count and the time?
> That is, I think we can send a keep-alive either if we skipped 10000
> changes or if we didn't sent anything for wal_sender_timeout / 2.
Yes, you are right.
Do you mean that when skipping every change, check if it has been more than
(wal_sender_timeout / 2) without sending anything?
IIUC, I tried to send keep-alive messages based on time before[1], but after
testing, I found that it will brings slight overhead. So I am not sure, in a
function(pgoutput_change) that is invoked frequently, should this kind of
overhead be introduced?

> Also, the patch changes the current behavior of wal senders; with the
> patch, we send keep-alive messages even when wal_sender_timeout = 0.
> But I'm not sure it's a good idea. The subscriber's
> wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
> I think it's better to periodically check replies and send a reply to
> the keep-alive message sent from the subscriber if necessary, for
> example, every 10000 skipped changes.
Sorry, I could not follow what you said. I am not sure, do you mean the
following?
1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped
10000 changes continuously, we will invoke the function WalSndKeepalive in the
function WalSndUpdateProgress, and send a keepalive message to the subscriber
with requesting an immediate reply.
2. If after sending a keepalive message, and then 10000 changes are skipped
continuously again. In this case, we need to handle the reply from the
subscriber-side when processing the 10000th change. The handling approach is to
reply to the confirmation message from the subscriber.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Please let me know if I understand wrong.

Regards,
Wang wei

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, Mar 16, 2022 at 11:57 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Wed, Mar 9, 2022 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> Thanks for your comments.
>
> > On Wed, Mar 9, 2022 at 10:26 AM I wrote:
> > > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> > > > I've looked at the patch and have a question:
> > > Thanks for your review and comments.
> > >
> > > > +void
> > > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > > > +        static int skipped_changes_count = 0;
> > > > +
> > > > +        /*
> > > > +         * skipped_changes_count is reset when processing changes that do
> > not
> > > > +         * need to be skipped.
> > > > +         */
> > > > +        if (!skipped)
> > > > +        {
> > > > +                skipped_changes_count = 0;
> > > > +                return;
> > > > +        }
> > > > +
> > > > +        /*
> > > > +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > > > changes, try to send a
> > > > +         * keepalive message.
> > > > +         */
> > > > +        #define SKIPPED_CHANGES_THRESHOLD 10000
> > > > +
> > > > +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > > > +        {
> > > > +                /* Try to send a keepalive message. */
> > > > +                OutputPluginUpdateProgress(ctx, true);
> > > > +
> > > > +                /* After trying to send a keepalive message, reset the flag. */
> > > > +                skipped_changes_count = 0;
> > > > +        }
> > > > +}
> > > >
> > > > Since we send a keepalive after continuously skipping 10000 changes, the
> > > > originally reported issue can still occur if skipping 10000 changes took more
> > than
> > > > the timeout and the walsender didn't send any change while that, is that
> > right?
> > > Yes, theoretically so.
> > > But after testing, I think this value should be conservative enough not to
> > reproduce
> > > this bug.
> >
> > But it really depends on the workload, the server condition, and the
> > timeout value, right? The logical decoding might involve disk I/O much
> > to spill/load intermediate data and the system might be under the
> > high-load condition. Why don't we check both the count and the time?
> > That is, I think we can send a keep-alive either if we skipped 10000
> > changes or if we didn't sent anything for wal_sender_timeout / 2.
> Yes, you are right.
> Do you mean that when skipping every change, check if it has been more than
> (wal_sender_timeout / 2) without sending anything?
> IIUC, I tried to send keep-alive messages based on time before[1], but after
> testing, I found that it will brings slight overhead. So I am not sure, in a
> function(pgoutput_change) that is invoked frequently, should this kind of
> overhead be introduced?
>
> > Also, the patch changes the current behavior of wal senders; with the
> > patch, we send keep-alive messages even when wal_sender_timeout = 0.
> > But I'm not sure it's a good idea. The subscriber's
> > wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
> > I think it's better to periodically check replies and send a reply to
> > the keep-alive message sent from the subscriber if necessary, for
> > example, every 10000 skipped changes.
> Sorry, I could not follow what you said. I am not sure, do you mean the
> following?
> 1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped
> 10000 changes continuously, we will invoke the function WalSndKeepalive in the
> function WalSndUpdateProgress, and send a keepalive message to the subscriber
> with requesting an immediate reply.
> 2. If after sending a keepalive message, and then 10000 changes are skipped
> continuously again. In this case, we need to handle the reply from the
> subscriber-side when processing the 10000th change. The handling approach is to
> reply to the confirmation message from the subscriber.

After more thought, can we check only wal_sender_timeout without
skip-count? That is, in WalSndUpdateProgress(), if we have received
any reply from the subscriber in last (wal_sender_timeout / 2), we
don't need to do anything in terms of keep-alive. If not, we do
ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then
WalSndKeepalivesIfNecessary(). That way, we can send keep-alive
messages every (wal_sender_timeout / 2). And since we don't call them
for every change, we would not need to worry about the overhead much.
Actually, WalSndWriteData() does similar things; even in the case
where we don't skip consecutive changes (i.e., sending consecutive
changes to the subscriber), we do ProcessRepliesIfAny() at least every
(wal_sender_timeout / 2). I think this would work in most common cases
where the user sets both wal_sender_timeout and wal_receiver_timeout
to the same value.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 11:57 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> > > But it really depends on the workload, the server condition, and the
> > > timeout value, right? The logical decoding might involve disk I/O much
> > > to spill/load intermediate data and the system might be under the
> > > high-load condition. Why don't we check both the count and the time?
> > > That is, I think we can send a keep-alive either if we skipped 10000
> > > changes or if we didn't sent anything for wal_sender_timeout / 2.
> > Yes, you are right.
> > Do you mean that when skipping every change, check if it has been more than
> > (wal_sender_timeout / 2) without sending anything?
> > IIUC, I tried to send keep-alive messages based on time before[1], but after
> > testing, I found that it will brings slight overhead. So I am not sure, in a
> > function(pgoutput_change) that is invoked frequently, should this kind of
> > overhead be introduced?
> >
> > > Also, the patch changes the current behavior of wal senders; with the
> > > patch, we send keep-alive messages even when wal_sender_timeout = 0.
> > > But I'm not sure it's a good idea. The subscriber's
> > > wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
> > > I think it's better to periodically check replies and send a reply to
> > > the keep-alive message sent from the subscriber if necessary, for
> > > example, every 10000 skipped changes.
> > Sorry, I could not follow what you said. I am not sure, do you mean the
> > following?
> > 1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped
> > 10000 changes continuously, we will invoke the function WalSndKeepalive in the
> > function WalSndUpdateProgress, and send a keepalive message to the subscriber
> > with requesting an immediate reply.
> > 2. If after sending a keepalive message, and then 10000 changes are skipped
> > continuously again. In this case, we need to handle the reply from the
> > subscriber-side when processing the 10000th change. The handling approach is to
> > reply to the confirmation message from the subscriber.
>
> After more thought, can we check only wal_sender_timeout without
> skip-count? That is, in WalSndUpdateProgress(), if we have received
> any reply from the subscriber in last (wal_sender_timeout / 2), we
> don't need to do anything in terms of keep-alive. If not, we do
> ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then
> WalSndKeepalivesIfNecessary(). That way, we can send keep-alive
> messages every (wal_sender_timeout / 2). And since we don't call them
> for every change, we would not need to worry about the overhead much.
>

But won't that lead to a call to GetCurrentTimestamp() for each change
we skip? IIUC from previous replies that lead to a slight slowdown in
previous tests of Wang-San.

> Actually, WalSndWriteData() does similar things;
>

That also every time seems to be calling GetCurrentTimestamp(). I
think it might be okay when we are sending the change but not sure if
the overhead of the same is negligible when we are skipping the
changes.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Mar 17, 2022 at 12:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > After more thought, can we check only wal_sender_timeout without
> > skip-count? That is, in WalSndUpdateProgress(), if we have received
> > any reply from the subscriber in last (wal_sender_timeout / 2), we
> > don't need to do anything in terms of keep-alive. If not, we do
> > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then
> > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive
> > messages every (wal_sender_timeout / 2). And since we don't call them
> > for every change, we would not need to worry about the overhead much.
> >
>
> But won't that lead to a call to GetCurrentTimestamp() for each change
> we skip? IIUC from previous replies that lead to a slight slowdown in
> previous tests of Wang-San.
>

If the above is true then I think we can use a lower skip_count say 10
along with a timeout mechanism to send keepalive message. This will
help us to alleviate the overhead Wang-San has shown.

BTW, I think there could be one other advantage of using
ProcessRepliesIfAny() (as you are suggesting) is that it can help to
release sync waiters if there are any. I feel that would be the case
for the skip_empty_transactions patch [1] which uses
WalSndUpdateProgress to send keepalive messages after skipping empty
transactions.

[1] - https://www.postgresql.org/message-id/CAFPTHDYvRSyT5ppYSPsH4Ozs0_W62-nffu0%3DmY1%2BsVipF%3DUN-g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Thu, Mar 17, 2022 at 7:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 17, 2022 at 12:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > After more thought, can we check only wal_sender_timeout without
> > > skip-count? That is, in WalSndUpdateProgress(), if we have received
> > > any reply from the subscriber in last (wal_sender_timeout / 2), we
> > > don't need to do anything in terms of keep-alive. If not, we do
> > > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then
> > > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive
> > > messages every (wal_sender_timeout / 2). And since we don't call them
> > > for every change, we would not need to worry about the overhead much.
> > >
> >
> > But won't that lead to a call to GetCurrentTimestamp() for each change
> > we skip? IIUC from previous replies that lead to a slight slowdown in
> > previous tests of Wang-San.
> >
> If the above is true then I think we can use a lower skip_count say 10
> along with a timeout mechanism to send keepalive message. This will
> help us to alleviate the overhead Wang-San has shown.

Using both sounds reasonable to me. I'd like to see how much the
overhead is alleviated by using skip_count 10 (or 100).

> BTW, I think there could be one other advantage of using
> ProcessRepliesIfAny() (as you are suggesting) is that it can help to
> release sync waiters if there are any. I feel that would be the case
> for the skip_empty_transactions patch [1] which uses
> WalSndUpdateProgress to send keepalive messages after skipping empty
> transactions.

+1

Regards,

[1]
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
Thanks for your comments.

> On Thu, Mar 17, 2022 at 7:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Mar 17, 2022 at 12:27 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > After more thought, can we check only wal_sender_timeout without
> > > > skip-count? That is, in WalSndUpdateProgress(), if we have
> > > > received any reply from the subscriber in last (wal_sender_timeout
> > > > / 2), we don't need to do anything in terms of keep-alive. If not,
> > > > we do
> > > > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then
> > > > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive
> > > > messages every (wal_sender_timeout / 2). And since we don't call
> > > > them for every change, we would not need to worry about the overhead
> much.
> > > >
> > >
> > > But won't that lead to a call to GetCurrentTimestamp() for each
> > > change we skip? IIUC from previous replies that lead to a slight
> > > slowdown in previous tests of Wang-San.
> > >
> > If the above is true then I think we can use a lower skip_count say 10
> > along with a timeout mechanism to send keepalive message. This will
> > help us to alleviate the overhead Wang-San has shown.
> 
> Using both sounds reasonable to me. I'd like to see how much the overhead is
> alleviated by using skip_count 10 (or 100).
> 
> > BTW, I think there could be one other advantage of using
> > ProcessRepliesIfAny() (as you are suggesting) is that it can help to
> > release sync waiters if there are any. I feel that would be the case
> > for the skip_empty_transactions patch [1] which uses
> > WalSndUpdateProgress to send keepalive messages after skipping empty
> > transactions.
> 
> +1
I modified the patch according to your and Amit-San's suggestions.
In addition, after testing, I found that when the threshold is 10, it brings
slight overhead.
So I try to change it to 100, after testing, the results look good to me.
10  : 1.22%--UpdateProgress
100 : 0.16%--UpdateProgress

Please refer to attachment.

Attach the new patch.
1. Refactor the way to send keepalive messages.
   [suggestion by Sawada-San, Amit-San.]
2. Modify the value of flag is_send initialization to make it look more
   reasonable. [suggestion by Kuroda-San.]
3. Improve new function names.
   (From SendKeepaliveIfNecessary to UpdateProgress.)

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thu, Mar 9, 2022 at 11:52 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> Thank you for updating!
Thanks for your comments.

> 1. pgoutput_change
> ```
> +       bool is_send = true;
> ```
> 
> My first impression is that is_send should be initialized to false, and it will change
> to true when OutputPluginWrite() is called.
> 
> 
> 2. pgoutput_change
> ```
> +                               {
> +                                       is_send = false;
> +                                       break;
> +                               }
> ```
> 
> Here are too many indents, but I think they should be removed.
> See above comment.
Fixed. Initialize is_send to false.

> 3. WalSndUpdateProgress
> ```
> +               /*
> +                * If half of wal_sender_timeout has lapsed without send message
> standby,
> +                * send a keep-alive message to the standby.
> +                */
> ```
> 
> The comment seems inconsistency with others.
> Here is "keep-alive", but other parts are "keepalive".
Since this part of the code was refactored, this inconsistent comment was
removed.

> 4. ReorderBufferProcessTXN
> ```
> +
change-
> >data.inval.ninvalidations,
> +
> + change->data.inval.invalidations);
> ```
> 
> Maybe these lines break 80-columns rule.
Thanks for reminder. I will run pg_ident later.

Kindly have a look at new patch shared in [1].

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275C67F14954E05CE5D04399E139%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Mar 18, 2022 at 10:43 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
>
> Attach the new patch.
>

*
  case REORDER_BUFFER_CHANGE_INVALIDATION:
- /* Execute the invalidation messages locally */
- ReorderBufferExecuteInvalidations(
-   change->data.inval.ninvalidations,
-   change->data.inval.invalidations);
- break;
+ {
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ /* Try to send a keepalive message. */
+ UpdateProgress(ctx, true);

Calling UpdateProgress() here appears adhoc to me especially because
it calls OutputPluginUpdateProgress which appears to be called only
from plugin API. Am, I missing something? Also why the same handling
is missed in other similar messages like
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call any
plug-in API?

I am not sure what is a good way to achieve this but one idea that
occurred to me was shall we invent a new callback
ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and
then pgoutput can register its API where we can have the logic similar
to what you have in UpdateProgress()? If we do so, then all the
cuurent callers of UpdateProgress in pgoutput can also call that API.
What do you think?

* Why don't you have a quick exit like below code in WalSndWriteData?
/* Try taking fast path unless we get too close to walsender timeout. */
if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
  wal_sender_timeout / 2) &&
!pq_is_send_pending())
{
return;
}

*  Can we rename variable 'is_send' to 'change_sent'?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Mar 18, 2022 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Mar 18, 2022 at 10:43 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> >
> > Attach the new patch.
> >
>
> *
>   case REORDER_BUFFER_CHANGE_INVALIDATION:
> - /* Execute the invalidation messages locally */
> - ReorderBufferExecuteInvalidations(
> -   change->data.inval.ninvalidations,
> -   change->data.inval.invalidations);
> - break;
> + {
> + LogicalDecodingContext *ctx = rb->private_data;
> +
> + /* Try to send a keepalive message. */
> + UpdateProgress(ctx, true);
>
> Calling UpdateProgress() here appears adhoc to me especially because
> it calls OutputPluginUpdateProgress which appears to be called only
> from plugin API. Am, I missing something? Also why the same handling
> is missed in other similar messages like
> REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call any
> plug-in API?
>
> I am not sure what is a good way to achieve this but one idea that
> occurred to me was shall we invent a new callback
> ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and
> then pgoutput can register its API where we can have the logic similar
> to what you have in UpdateProgress()? If we do so, then all the
> cuurent callers of UpdateProgress in pgoutput can also call that API.
> What do you think?
>

Another idea could be that we leave the DDL case for now as anyway
there is very less chance of timeout for skipping DDLs and we may
later need to even backpatch this bug-fix which would be another
reason to not make such invasive changes. We can handle the DDL case
if required separately.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Mar 21, 2022 at 1:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
Thanks for your comments.

> On Fri, Mar 18, 2022 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Mar 18, 2022 at 10:43 AM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > >
> > > Attach the new patch.
> > >
> >
> > *
> >   case REORDER_BUFFER_CHANGE_INVALIDATION:
> > - /* Execute the invalidation messages locally */
> > - ReorderBufferExecuteInvalidations(
> > -   change->data.inval.ninvalidations,
> > -   change->data.inval.invalidations);
> > - break;
> > + {
> > + LogicalDecodingContext *ctx = rb->private_data;
> > +
> > + /* Try to send a keepalive message. */
> > + UpdateProgress(ctx, true);
> >
> > Calling UpdateProgress() here appears adhoc to me especially because
> > it calls OutputPluginUpdateProgress which appears to be called only
> > from plugin API. Am, I missing something? Also why the same handling
> > is missed in other similar messages like
> > REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call
> any
> > plug-in API?
Yes, you are right.
And I invoke in case REORDER_BUFFER_CHANGE_INVALIDATION because I think every
DDL will modify the catalog then get into this case. So I only invoke function
UpdateProgress here to handle DDL.

> > I am not sure what is a good way to achieve this but one idea that
> > occurred to me was shall we invent a new callback
> > ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and
> > then pgoutput can register its API where we can have the logic similar
> > to what you have in UpdateProgress()? If we do so, then all the
> > cuurent callers of UpdateProgress in pgoutput can also call that API.
> > What do you think?
> >
> Another idea could be that we leave the DDL case for now as anyway
> there is very less chance of timeout for skipping DDLs and we may
> later need to even backpatch this bug-fix which would be another
> reason to not make such invasive changes. We can handle the DDL case
> if required separately.
Yes, I think a new callback function would be nice.
Yes, as you said, maybe we could fix the usecase that found the problem in the
first place. Then make further modifications on the master branch.
Modify the patch. Currently only DML related code remains.

> > * Why don't you have a quick exit like below code in WalSndWriteData?
> > /* Try taking fast path unless we get too close to walsender timeout. */ if (now
> > < TimestampTzPlusMilliseconds(last_reply_timestamp,
> >   wal_sender_timeout / 2) &&
> > !pq_is_send_pending())
> > {
> > return;
> > }
Fixed. I missed this so adding it in the new patch.

> > *  Can we rename variable 'is_send' to 'change_sent'?
Improve the the name of this variable.(From 'is_send' to 'change_sent')

Attach the new patch. [suggestion by Amit-San.]
1. Remove DDL related code. Handle the DDL case later separately if need.
2. Fix a missing.(In function WalSndUpdateProgress)
3. Improve variable names. (From 'is_send' to 'change_sent')
4. Fix some comments.(Above and inside the function WalSndUpdateProgress.)

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Mar 22, 2022 at 7:25 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> Attach the new patch.
>

It seems by mistake you have removed the changes from pgoutput_message
and pgoutput_truncate functions. I have added those back.
Additionally, I made a few other changes: (a) moved the function
UpdateProgress to pgoutput.c as it is not used outside it, (b) change
the new parameter in plugin API from 'send_keep_alive' to 'last_write'
to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c)
made a number of changes in WalSndUpdateProgress API, it is better to
move keep-alive code after lag track code because we do process
replies at that time and there it will compute the lag; (d)
changed/added comments in the code.

Do let me know what you think of the attached?

-- 
With Regards,
Amit Kapila.

Вложения

RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Amit,

> It seems by mistake you have removed the changes from pgoutput_message
> and pgoutput_truncate functions. I have added those back.
> Additionally, I made a few other changes: (a) moved the function
> UpdateProgress to pgoutput.c as it is not used outside it, (b) change
> the new parameter in plugin API from 'send_keep_alive' to 'last_write'
> to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c)
> made a number of changes in WalSndUpdateProgress API, it is better to
> move keep-alive code after lag track code because we do process
> replies at that time and there it will compute the lag; (d)
> changed/added comments in the code.

LGTM, but the patch cannot be applied to current HEAD.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thur, Mar 24, 2022 at 6:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
Thanks for your kindly update.

> It seems by mistake you have removed the changes from pgoutput_message
> and pgoutput_truncate functions. I have added those back.
> Additionally, I made a few other changes: (a) moved the function
> UpdateProgress to pgoutput.c as it is not used outside it, (b) change
> the new parameter in plugin API from 'send_keep_alive' to 'last_write'
> to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c)
> made a number of changes in WalSndUpdateProgress API, it is better to
> move keep-alive code after lag track code because we do process
> replies at that time and there it will compute the lag; (d)
> changed/added comments in the code.
> 
> Do let me know what you think of the attached?
It looks good to me. Just rebase it because the change in header(75b1521).
I tested it and the result looks good to me.

Attach the new patch.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Thur, Mar 24, 2022 at 6:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> Thanks for your kindly update.
>
> > It seems by mistake you have removed the changes from pgoutput_message
> > and pgoutput_truncate functions. I have added those back.
> > Additionally, I made a few other changes: (a) moved the function
> > UpdateProgress to pgoutput.c as it is not used outside it, (b) change
> > the new parameter in plugin API from 'send_keep_alive' to 'last_write'
> > to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c)
> > made a number of changes in WalSndUpdateProgress API, it is better to
> > move keep-alive code after lag track code because we do process
> > replies at that time and there it will compute the lag; (d)
> > changed/added comments in the code.
> >
> > Do let me know what you think of the attached?
> It looks good to me. Just rebase it because the change in header(75b1521).
> I tested it and the result looks good to me.

Since commit 75b1521 added decoding of sequence to logical
replication, the patch needs to have pgoutput_sequence() call
update_progress().

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Mar 25, 2022 at 11:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
>
> Since commit 75b1521 added decoding of sequence to logical
> replication, the patch needs to have pgoutput_sequence() call
> update_progress().
>

Yeah, I also think this needs to be addressed. But apart from this, I
want to know your and other's opinion on the following two points:
a. Both this and the patch discussed in the nearby thread [1] add an
additional parameter to
WalSndUpdateProgress/OutputPluginUpdateProgress and it seems to me
that both are required. The additional parameter 'last_write' added by
this patch indicates: "If the last write is skipped then try (if we
are close to wal_sender_timeout) to send a keepalive message to the
receiver to avoid timeouts.". This means it can be used after any
'write' message. OTOH, the parameter 'skipped_xact' added by another
patch [1] indicates if we have skipped sending anything for a
transaction then sendkeepalive for synchronous replication to avoid
any delays in such a transaction. Does this sound reasonable or can
you think of a better way to deal with it?
b. Do we want to backpatch the patch in this thread? I am reluctant to
backpatch because it changes the exposed API which can have an impact
and second there exists a workaround (user can increase
wal_sender_timeout/wal_receiver_timeout).


[1] -
https://www.postgresql.org/message-id/OS0PR01MB5716BB24409D4B69206615B1941A9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Mar 25, 2022 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Thur, Mar 24, 2022 at 6:32 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > Thanks for your kindly update.
> >
> > > It seems by mistake you have removed the changes from
> pgoutput_message
> > > and pgoutput_truncate functions. I have added those back.
> > > Additionally, I made a few other changes: (a) moved the function
> > > UpdateProgress to pgoutput.c as it is not used outside it, (b) change
> > > the new parameter in plugin API from 'send_keep_alive' to 'last_write'
> > > to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c)
> > > made a number of changes in WalSndUpdateProgress API, it is better to
> > > move keep-alive code after lag track code because we do process
> > > replies at that time and there it will compute the lag; (d)
> > > changed/added comments in the code.
> > >
> > > Do let me know what you think of the attached?
> > It looks good to me. Just rebase it because the change in header(75b1521).
> > I tested it and the result looks good to me.
> 
> Since commit 75b1521 added decoding of sequence to logical
> replication, the patch needs to have pgoutput_sequence() call
> update_progress().
Thanks for your comments.

Yes, you are right.
Add missing handling of pgoutput_sequence.

Attach the new patch.

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Wang-san,

Thank you for updating!
...but it also cannot be applied to current HEAD
because of the commit 923def9a533.

Your patch seems to conflict the adding an argument of logicalrep_write_insert().
It allows specifying columns to publish by skipping some columns in logicalrep_write_tuple()
which is called from logicalrep_write_insert() and logicalrep_write_update().

Do we have to consider something special case for that?
I thought timeout may occur if users have huge table and publish few columns,
but it is corner case.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Mar 28, 2022 at 9:56 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> Dear Wang-san,
Thanks for your comments.

> Thank you for updating!
> ...but it also cannot be applied to current HEAD
> because of the commit 923def9a533.
> 
> Your patch seems to conflict the adding an argument of
> logicalrep_write_insert().
> It allows specifying columns to publish by skipping some columns in
> logicalrep_write_tuple()
> which is called from logicalrep_write_insert() and logicalrep_write_update().
Thank for your kindly reminder.
Rebase the patch.

> Do we have to consider something special case for that?
> I thought timeout may occur if users have huge table and publish few columns,
> but it is corner case.
I think maybe we do not need to deal with this use case.
The maximum number of table columns allowed by PG is 1600
(macro MaxHeapAttributeNumber), and after loop through all columns in the
function logicalrep_write_tuple, the function OutputPluginWrite will be invoked
immediately to actually send the data to the subscriber. This refreshes the
last time the subscriber received a message.
So I think this loop will not cause timeout issues.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Mar 28, 2022 at 11:41 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Mar 28, 2022 at 9:56 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
>
> > Do we have to consider something special case for that?
> > I thought timeout may occur if users have huge table and publish few columns,
> > but it is corner case.
> I think maybe we do not need to deal with this use case.
> The maximum number of table columns allowed by PG is 1600
> (macro MaxHeapAttributeNumber), and after loop through all columns in the
> function logicalrep_write_tuple, the function OutputPluginWrite will be invoked
> immediately to actually send the data to the subscriber. This refreshes the
> last time the subscriber received a message.
> So I think this loop will not cause timeout issues.
>

Right, I also don't think it can be a source of timeout.

--
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Amit, Wang,

> > I think maybe we do not need to deal with this use case.
> > The maximum number of table columns allowed by PG is 1600
> > (macro MaxHeapAttributeNumber), and after loop through all columns in the
> > function logicalrep_write_tuple, the function OutputPluginWrite will be invoked
> > immediately to actually send the data to the subscriber. This refreshes the
> > last time the subscriber received a message.
> > So I think this loop will not cause timeout issues.
> >
> 
> Right, I also don't think it can be a source of timeout.

OK. I have no comments for this version.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Mar 28, 2022 at 2:11 AM I wrote:
> Rebase the patch.

After reviewing anohter patch[1], I think this patch should also add a loop in
function WalSndUpdateProgress like what did in function WalSndWriteData.
So update the patch to be consistent with the existing code and the patch
mentioned above.

Attach the new patch.

[1] -
https://www.postgresql.org/message-id/OS0PR01MB5716946347F607F4CFB02FCE941D9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Fri, Mar 25, 2022 at 5:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Mar 25, 2022 at 11:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> >
> > Since commit 75b1521 added decoding of sequence to logical
> > replication, the patch needs to have pgoutput_sequence() call
> > update_progress().
> >
>
> Yeah, I also think this needs to be addressed. But apart from this, I
> want to know your and other's opinion on the following two points:
> a. Both this and the patch discussed in the nearby thread [1] add an
> additional parameter to
> WalSndUpdateProgress/OutputPluginUpdateProgress and it seems to me
> that both are required. The additional parameter 'last_write' added by
> this patch indicates: "If the last write is skipped then try (if we
> are close to wal_sender_timeout) to send a keepalive message to the
> receiver to avoid timeouts.". This means it can be used after any
> 'write' message. OTOH, the parameter 'skipped_xact' added by another
> patch [1] indicates if we have skipped sending anything for a
> transaction then sendkeepalive for synchronous replication to avoid
> any delays in such a transaction. Does this sound reasonable or can
> you think of a better way to deal with it?

These current approaches look good to me.

> b. Do we want to backpatch the patch in this thread? I am reluctant to
> backpatch because it changes the exposed API which can have an impact
> and second there exists a workaround (user can increase
> wal_sender_timeout/wal_receiver_timeout).

Yeah, we should avoid API changes between minor versions. I feel it's
better to fix it also for back-branches but probably we need another
fix for them. The issue reported on this thread seems quite
confusable; it looks like a network problem but is not true. Also, the
user who faced this issue has to increase wal_sender_timeout due to
the decoded data size, which also means to delay detecting network
problems. It seems an unrelated trade-off.

Regards,
-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tues, Mar 29, 2022 at 9:45 AM I wrote:
> Attach the new patch.

Rebase the patch because the commit d5a9d86d in current HEAD.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Mar 30, 2022 at 1:24 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Tues, Mar 29, 2022 at 9:45 AM I wrote:
> > Attach the new patch.
>
> Rebase the patch because the commit d5a9d86d in current HEAD.
>

Thanks, this looks good to me apart from a minor indentation change
which I'll take care of before committing. I am planning to push this
day after tomorrow on Friday unless there are any other major
comments.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"shiy.fnst@fujitsu.com"
Дата:
On Wed, Mar 30, 2022 3:54 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
> 
> Rebase the patch because the commit d5a9d86d in current HEAD.
> 

Thanks for your patch, I tried this patch and confirmed that there is no timeout
problem after applying this patch, and I could reproduce this problem on HEAD.

Regards,
Shi yu

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, Mar 30, 2022 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 30, 2022 at 1:24 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Tues, Mar 29, 2022 at 9:45 AM I wrote:
> > > Attach the new patch.
> >
> > Rebase the patch because the commit d5a9d86d in current HEAD.
> >
>
> Thanks, this looks good to me apart from a minor indentation change
> which I'll take care of before committing. I am planning to push this
> day after tomorrow on Friday unless there are any other major
> comments.

The patch basically looks good to me. But the only concern to me is
that once we get the patch committed, we will have to call
update_progress() at all paths in callbacks that process changes.
Which seems poor maintainability.

On the other hand, possible another solution would be to add a new
callback that is called e.g., every 1000 changes so that walsender
does its job such as timeout handling while processing the decoded
data in reorderbuffer.c. The callback is set only if the walsender
does logical decoding, otherwise NULL. With this idea, other plugins
will also be able to benefit without changes. But I’m not really sure
it’s a good design, and adding a new callback introduces complexity.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Mar 31, 2022 at 5:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Mar 30, 2022 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 30, 2022 at 1:24 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > On Tues, Mar 29, 2022 at 9:45 AM I wrote:
> > > > Attach the new patch.
> > >
> > > Rebase the patch because the commit d5a9d86d in current HEAD.
> > >
> >
> > Thanks, this looks good to me apart from a minor indentation change
> > which I'll take care of before committing. I am planning to push this
> > day after tomorrow on Friday unless there are any other major
> > comments.
>
> The patch basically looks good to me. But the only concern to me is
> that once we get the patch committed, we will have to call
> update_progress() at all paths in callbacks that process changes.
> Which seems poor maintainability.
>
> On the other hand, possible another solution would be to add a new
> callback that is called e.g., every 1000 changes so that walsender
> does its job such as timeout handling while processing the decoded
> data in reorderbuffer.c. The callback is set only if the walsender
> does logical decoding, otherwise NULL. With this idea, other plugins
> will also be able to benefit without changes. But I’m not really sure
> it’s a good design, and adding a new callback introduces complexity.
>

Yeah, same here. I have also mentioned another way to expose an API
from reorderbuffer [1] by introducing a skip API but just not sure if
that or this API is generic enough to make it adding worth. Also, note
that the current patch makes the progress recording of large
transactions somewhat better when most of the changes are skipped. We
can further extend it to make it true for other cases as well but that
probably can be done separately if required as that is not required
for this bug-fix.

I intend to commit this patch today but I think it is better to wait
for a few more days to see if anybody has any opinion on this matter.
I'll push this on Tuesday unless we decide to do something different
here.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
"Euler Taveira"
Дата:
On Thu, Mar 31, 2022, at 9:24 AM, Masahiko Sawada wrote:
The patch basically looks good to me. But the only concern to me is
that once we get the patch committed, we will have to call
update_progress() at all paths in callbacks that process changes.
Which seems poor maintainability.
I didn't like the current fix for the same reason. We need a robust feedback
system for logical replication. We had this discussion in the "skip empty
transactions" thread [1].

On the other hand, possible another solution would be to add a new
callback that is called e.g., every 1000 changes so that walsender
does its job such as timeout handling while processing the decoded
data in reorderbuffer.c. The callback is set only if the walsender
does logical decoding, otherwise NULL. With this idea, other plugins
will also be able to benefit without changes. But I’m not really sure
it’s a good design, and adding a new callback introduces complexity.
No new callback is required.

In the current code, each output plugin callback is responsible to call
OutputPluginUpdateProgress. It is up to the output plugin author to add calls
to this function. The lack of a call in a callback might cause issues like what
was described in the initial message.

The functions CreateInitDecodingContext and CreateDecodingContext receives the
update_progress function as a parameter. These functions are called in 2
places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)
SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
WalSndUpdateProgress as a progress function. Case (b) does not have one because
it is not required -- local decoding/communication. There is no custom update
progress routine for each output plugin which leads me to the question:
couldn't we encapsulate the update progress call into the callback functions?
If so, we could have an output plugin parameter to inform which callbacks we
would like to call the update progress routine. This would simplify the code,
make it less error prone and wouldn't impose a burden on maintainability.



--
Euler Taveira

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Apr 1, 2022 at 7:33 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Thu, Mar 31, 2022, at 9:24 AM, Masahiko Sawada wrote:
>
> On the other hand, possible another solution would be to add a new
> callback that is called e.g., every 1000 changes so that walsender
> does its job such as timeout handling while processing the decoded
> data in reorderbuffer.c. The callback is set only if the walsender
> does logical decoding, otherwise NULL. With this idea, other plugins
> will also be able to benefit without changes. But I’m not really sure
> it’s a good design, and adding a new callback introduces complexity.
>
> No new callback is required.
>
> In the current code, each output plugin callback is responsible to call
> OutputPluginUpdateProgress. It is up to the output plugin author to add calls
> to this function. The lack of a call in a callback might cause issues like what
> was described in the initial message.
>

This is exactly our initial analysis and we have tried a patch on
these lines and it has a noticeable overhead. See [1]. Calling this
for each change or each skipped change can bring noticeable overhead
that is why we decided to call it after a certain threshold (100) of
skipped changes. Now, surely as mentioned in my previous reply we can
make it generic such that instead of calling this (update_progress
function as in the patch) for skipped cases, we call it always. Will
that make it better?

> The functions CreateInitDecodingContext and CreateDecodingContext receives the
> update_progress function as a parameter. These functions are called in 2
> places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)
> SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> WalSndUpdateProgress as a progress function. Case (b) does not have one because
> it is not required -- local decoding/communication. There is no custom update
> progress routine for each output plugin which leads me to the question:
> couldn't we encapsulate the update progress call into the callback functions?
>

Sorry, I don't get your point. What exactly do you mean by this?
AFAIS, currently we call this output plugin API in pgoutput functions
only, do you intend to get it invoked from a different place?

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
"Euler Taveira"
Дата:
On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
This is exactly our initial analysis and we have tried a patch on
these lines and it has a noticeable overhead. See [1]. Calling this
for each change or each skipped change can bring noticeable overhead
that is why we decided to call it after a certain threshold (100) of
skipped changes. Now, surely as mentioned in my previous reply we can
make it generic such that instead of calling this (update_progress
function as in the patch) for skipped cases, we call it always. Will
that make it better?
That's what I have in mind but using a different approach.

> The functions CreateInitDecodingContext and CreateDecodingContext receives the
> update_progress function as a parameter. These functions are called in 2
> places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)
> SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> WalSndUpdateProgress as a progress function. Case (b) does not have one because
> it is not required -- local decoding/communication. There is no custom update
> progress routine for each output plugin which leads me to the question:
> couldn't we encapsulate the update progress call into the callback functions?
>

Sorry, I don't get your point. What exactly do you mean by this?
AFAIS, currently we call this output plugin API in pgoutput functions
only, do you intend to get it invoked from a different place?
It seems I didn't make myself clear. The callbacks I'm referring to the
*_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
*_cb_wrapper() function, we have something like:

if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
    NewUpdateProgress(ctx, false);

The NewUpdateProgress function would contain a logic similar to the
update_progress() from the proposed patch. (A different function name here just
to avoid confusion.)

The output plugin is responsible to set ctx->progress with the callback
variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) that we would
like to run NewUpdateProgress.


--
Euler Taveira

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
>
> This is exactly our initial analysis and we have tried a patch on
> these lines and it has a noticeable overhead. See [1]. Calling this
> for each change or each skipped change can bring noticeable overhead
> that is why we decided to call it after a certain threshold (100) of
> skipped changes. Now, surely as mentioned in my previous reply we can
> make it generic such that instead of calling this (update_progress
> function as in the patch) for skipped cases, we call it always. Will
> that make it better?
>
> That's what I have in mind but using a different approach.
>
> > The functions CreateInitDecodingContext and CreateDecodingContext receives the
> > update_progress function as a parameter. These functions are called in 2
> > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)
> > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > WalSndUpdateProgress as a progress function. Case (b) does not have one because
> > it is not required -- local decoding/communication. There is no custom update
> > progress routine for each output plugin which leads me to the question:
> > couldn't we encapsulate the update progress call into the callback functions?
> >
>
> Sorry, I don't get your point. What exactly do you mean by this?
> AFAIS, currently we call this output plugin API in pgoutput functions
> only, do you intend to get it invoked from a different place?
>
> It seems I didn't make myself clear. The callbacks I'm referring to the
> *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> *_cb_wrapper() function, we have something like:
>
> if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
>     NewUpdateProgress(ctx, false);
>
> The NewUpdateProgress function would contain a logic similar to the
> update_progress() from the proposed patch. (A different function name here just
> to avoid confusion.)
>
> The output plugin is responsible to set ctx->progress with the callback
> variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) that we would
> like to run NewUpdateProgress.
>

This sounds like a conflicting approach to what we currently do.
Currently, OutputPluginUpdateProgress() is called from the xact
related pgoutput functions like pgoutput_commit_txn(),
pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
follow what you are saying then for some of the APIs like
pgoutput_change/_message/_truncate, we need to set the parameter to
invoke NewUpdateProgress() which will internally call
OutputPluginUpdateProgress(), and for the remaining APIs, we will call
in the corresponding pgoutput_* function. I feel if we want to make it
more generic than the current patch, it is better to directly call
what you are referring to here as NewUpdateProgress() in all remaining
APIs like pgoutput_change/_truncate, etc.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Apr 1, 2022 at 12:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
> >
> > This is exactly our initial analysis and we have tried a patch on
> > these lines and it has a noticeable overhead. See [1]. Calling this
> > for each change or each skipped change can bring noticeable overhead
> > that is why we decided to call it after a certain threshold (100) of
> > skipped changes. Now, surely as mentioned in my previous reply we can
> > make it generic such that instead of calling this (update_progress
> > function as in the patch) for skipped cases, we call it always. Will
> > that make it better?
> >
> > That's what I have in mind but using a different approach.
> >
> > > The functions CreateInitDecodingContext and CreateDecodingContext
> receives the
> > > update_progress function as a parameter. These functions are called in 2
> > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and
> (b)
> > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > > WalSndUpdateProgress as a progress function. Case (b) does not have one
> because
> > > it is not required -- local decoding/communication. There is no custom
> update
> > > progress routine for each output plugin which leads me to the question:
> > > couldn't we encapsulate the update progress call into the callback functions?
> > >
> >
> > Sorry, I don't get your point. What exactly do you mean by this?
> > AFAIS, currently we call this output plugin API in pgoutput functions
> > only, do you intend to get it invoked from a different place?
> >
> > It seems I didn't make myself clear. The callbacks I'm referring to the
> > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > *_cb_wrapper() function, we have something like:
> >
> > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> >     NewUpdateProgress(ctx, false);
> >
> > The NewUpdateProgress function would contain a logic similar to the
> > update_progress() from the proposed patch. (A different function name here
> just
> > to avoid confusion.)
> >
> > The output plugin is responsible to set ctx->progress with the callback
> > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb())
> that we would
> > like to run NewUpdateProgress.
> >
> 
> This sounds like a conflicting approach to what we currently do.
> Currently, OutputPluginUpdateProgress() is called from the xact
> related pgoutput functions like pgoutput_commit_txn(),
> pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
> follow what you are saying then for some of the APIs like
> pgoutput_change/_message/_truncate, we need to set the parameter to
> invoke NewUpdateProgress() which will internally call
> OutputPluginUpdateProgress(), and for the remaining APIs, we will call
> in the corresponding pgoutput_* function. I feel if we want to make it
> more generic than the current patch, it is better to directly call
> what you are referring to here as NewUpdateProgress() in all remaining
> APIs like pgoutput_change/_truncate, etc.
Thanks for your comments.

According to your suggestion, improve the patch to make it more generic.
Attach the new patch.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Apr 6, 2022 at 11:09 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Fri, Apr 1, 2022 at 12:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > It seems I didn't make myself clear. The callbacks I'm referring to the
> > > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > > *_cb_wrapper() function, we have something like:
> > >
> > > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> > >     NewUpdateProgress(ctx, false);
> > >
> > > The NewUpdateProgress function would contain a logic similar to the
> > > update_progress() from the proposed patch. (A different function name here
> > just
> > > to avoid confusion.)
> > >
> > > The output plugin is responsible to set ctx->progress with the callback
> > > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb())
> > that we would
> > > like to run NewUpdateProgress.
> > >
> >
> > This sounds like a conflicting approach to what we currently do.
> > Currently, OutputPluginUpdateProgress() is called from the xact
> > related pgoutput functions like pgoutput_commit_txn(),
> > pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
> > follow what you are saying then for some of the APIs like
> > pgoutput_change/_message/_truncate, we need to set the parameter to
> > invoke NewUpdateProgress() which will internally call
> > OutputPluginUpdateProgress(), and for the remaining APIs, we will call
> > in the corresponding pgoutput_* function. I feel if we want to make it
> > more generic than the current patch, it is better to directly call
> > what you are referring to here as NewUpdateProgress() in all remaining
> > APIs like pgoutput_change/_truncate, etc.
> Thanks for your comments.
>
> According to your suggestion, improve the patch to make it more generic.
> Attach the new patch.
>

 typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
LogicalDecodingContext *lr,
  XLogRecPtr Ptr,
  TransactionId xid,
- bool skipped_xact
+ bool skipped_xact,
+ bool last_write

In this approach, I don't think we need an additional parameter
last_write. Let's do the work related to keepalive without a
parameter, do you see any problem with that?

Also, let's try to evaluate how it impacts lag functionality for large
transactions?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Apr 6, 2022 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 6, 2022 at 11:09 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > According to your suggestion, improve the patch to make it more generic.
> > Attach the new patch.
> >
>
>  typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
> LogicalDecodingContext *lr,
>   XLogRecPtr Ptr,
>   TransactionId xid,
> - bool skipped_xact
> + bool skipped_xact,
> + bool last_write
>
> In this approach, I don't think we need an additional parameter
> last_write. Let's do the work related to keepalive without a
> parameter, do you see any problem with that?
>

I think this patch doesn't take into account that we call
OutputPluginUpdateProgress() from APIs like pgoutput_commit_txn(). I
think we should always call the new function update_progress from
those existing call sites and arrange the function such that when
called from xact end APIs like pgoutput_commit_txn(), it always call
OutputPluginUpdateProgress and make changes_count as 0.


-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Apr 6, 2022 at 1:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
Thanks for your comments.

>  typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
> LogicalDecodingContext *lr,
>   XLogRecPtr Ptr,
>   TransactionId xid,
> - bool skipped_xact
> + bool skipped_xact,
> + bool last_write
> 
> In this approach, I don't think we need an additional parameter last_write. Let's
> do the work related to keepalive without a parameter, do you see any problem
> with that?
I agree with you. Modify this point.

> I think this patch doesn't take into account that we call
> OutputPluginUpdateProgress() from APIs like pgoutput_commit_txn(). I
> think we should always call the new function update_progress from
> those existing call sites and arrange the function such that when
> called from xact end APIs like pgoutput_commit_txn(), it always call
> OutputPluginUpdateProgress and make changes_count as 0.
Improve it.
Add two new input to function update_progress.(skipped_xact and end_xact).
Modify the function invoke from OutputPluginUpdateProgress to update_progress.

> Also, let's try to evaluate how it impacts lag functionality for large transactions?
I think this patch will not affect lag functionality. It will updates the lag
field of view pg_stat_replication more frequently.
IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of
change and invoking time in lag_tracker. Then when invoking function
ProcessStandbyReplyMessage, it will calculate the lag field according to the
message from subscriber and the information in lag_tracker. This patch does
not modify this logic, but only increases the frequency of invoking.
Please let me know if I understand wrong.

Attach the new patch.
1. Remove the new function input parameters in this patch(parameter last_write
of WalSndUpdateProgress). [suggestion by Amit-San]
2. Also invoke function update_progress in other xact end APIs like
pgoutput_commit_txn. [suggestion by Amit-San]

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Apr 6, 2022 at 6:30 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Wed, Apr 6, 2022 at 1:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> Thanks for your comments.
>
> >  typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
> > LogicalDecodingContext *lr,
> >   XLogRecPtr Ptr,
> >   TransactionId xid,
> > - bool skipped_xact
> > + bool skipped_xact,
> > + bool last_write
> >
> > In this approach, I don't think we need an additional parameter last_write. Let's
> > do the work related to keepalive without a parameter, do you see any problem
> > with that?
> I agree with you. Modify this point.
>
> > I think this patch doesn't take into account that we call
> > OutputPluginUpdateProgress() from APIs like pgoutput_commit_txn(). I
> > think we should always call the new function update_progress from
> > those existing call sites and arrange the function such that when
> > called from xact end APIs like pgoutput_commit_txn(), it always call
> > OutputPluginUpdateProgress and make changes_count as 0.
> Improve it.
> Add two new input to function update_progress.(skipped_xact and end_xact).
> Modify the function invoke from OutputPluginUpdateProgress to update_progress.
>
> > Also, let's try to evaluate how it impacts lag functionality for large transactions?
> I think this patch will not affect lag functionality. It will updates the lag
> field of view pg_stat_replication more frequently.
> IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of
> change and invoking time in lag_tracker. Then when invoking function
> ProcessStandbyReplyMessage, it will calculate the lag field according to the
> message from subscriber and the information in lag_tracker. This patch does
> not modify this logic, but only increases the frequency of invoking.
> Please let me know if I understand wrong.
>

No, your understanding seems correct to me. But what I want to check
is if calling the progress function more often has any impact on
lag-related fields in pg_stat_replication? I think you need to check
the impact of large transaction replay.

One comment:
+static void
+update_progress(LogicalDecodingContext *ctx, bool skipped_xact, bool end_xact)
+{
+ static int changes_count = 0;
+
+ if (end_xact)
+ {
+ /* Update progress tracking at xact end. */
+ OutputPluginUpdateProgress(ctx, skipped_xact);
+ changes_count = 0;
+ }
+ /*
+ * After continuously processing CHANGES_THRESHOLD changes, update progress
+ * which will also try to send a keepalive message if required.

I think you can simply return after making changes_count = 0. There
should be an empty line before starting the next comment.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Apr 7, 2022 at 1:34 PM Amit Kapila <amit.kapila16@gmail.com>  wrote:
>
Thanks for your comments.

> One comment:
> +static void
> +update_progress(LogicalDecodingContext *ctx, bool skipped_xact, bool
> end_xact)
> +{
> + static int changes_count = 0;
> +
> + if (end_xact)
> + {
> + /* Update progress tracking at xact end. */
> + OutputPluginUpdateProgress(ctx, skipped_xact);
> + changes_count = 0;
> + }
> + /*
> + * After continuously processing CHANGES_THRESHOLD changes, update
> progress
> + * which will also try to send a keepalive message if required.
> 
> I think you can simply return after making changes_count = 0. There
> should be an empty line before starting the next comment.
Improve as suggested.
BTW, there is a conflict in current HEAD when applying v12 because of the
commit 2c7ea57. Also rebase it.

Attach the new patch.
1. Make some improvements to the new function update_progress. [suggestion by Amit-San]
2. Rebase the patch because the commit 2c7ea57 in current HEAD.

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Apr 7, 2022 at 1:34 PM Amit Kapila <amit.kapila16@gmail.com>  wrote:
> On Wed, Apr 6, 2022 at 6:30 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Wed, Apr 6, 2022 at 1:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Also, let's try to evaluate how it impacts lag functionality for large
> transactions?
> > I think this patch will not affect lag functionality. It will updates the lag
> > field of view pg_stat_replication more frequently.
> > IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of
> > change and invoking time in lag_tracker. Then when invoking function
> > ProcessStandbyReplyMessage, it will calculate the lag field according to the
> > message from subscriber and the information in lag_tracker. This patch does
> > not modify this logic, but only increases the frequency of invoking.
> > Please let me know if I understand wrong.
> >
> 
> No, your understanding seems correct to me. But what I want to check
> is if calling the progress function more often has any impact on
> lag-related fields in pg_stat_replication? I think you need to check
> the impact of large transaction replay.
Thanks for the explanation.

After doing some checks, I found that the v13 patch makes the calculations of
lag functionality inaccurate.

In short, v13 patch lets us try to track lag more frequently and try to send a
keepalive message to subscribers. But in order to prevent flooding the lag
tracker, we could not track lag more than once within
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress).
This means we may lose informations that needs to be tracked.
For example, suppose there is a large transaction with lsn from lsn1 to lsn3.
In HEAD, when we calculate the lag time for lsn3, the lag time of lsn3 is
(now - lsn3.time).
But with v13 patch, when we calculate the lag time for lsn3, because there
maybe no informations of lsn3 but has informations of lsn2 in lag_tracker, the
lag time of lsn3 is (now - t2.time). (see function LagTrackerRead)
Therefore, if we lose the informations that need to be tracked, the lag time
becomes large and inaccurate.

So I skip tracking lag during a transaction just like the current HEAD.
Attach the new patch.

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Apr 11, 2022 at 2:39 PM I wrote:
> Attach the new patch.
Also, share test results and details.

To check that the lsn information used for the calculation is what we expected,
I get some information by adding logs in the function LagTrackerRead.

Summary of test results:
- In current HEAD and current HEAD with v14 patch, we could found the
  information of same lsn as received from subscriber-side in lag_tracker.
- In current HEAD with v13 patch, we could hardly found the information of same
  lsn in lag_tracker.

Attach the details:
[The log by HEAD]
the lsn we received from subscriber  |  the lsn whose time we used to calculate in lag_tracker
382826584                            |  382826584
743884840                            |  743884840
1104943232                           |  1104943232
1468949424                           |  1468949424
1469521216                           |  1469521216

[The log by HEAD with v14 patch]
the lsn we received from subscriber  |  the lsn whose time we used to calculate in lag_tracker
382826584                            |  382826584
743890672                            |  743890672
1105074264                           |  1105074264
1469127040                           |  1469127040
1830591240                           |  1830591240

[The log by HEAD with v13 patch]
the lsn we received from subscriber  |  the lsn whose time we used to calculate in lag_tracker
382826584                            |  359848728 
743884840                            |  713808560 
1105010640                           |  1073978544
1468517536                           |  1447850160
1469516328                           |  1469516328

Regards,
Wang wei

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> So I skip tracking lag during a transaction just like the current HEAD.
> Attach the new patch.
>

Thanks, please find the updated patch where I have slightly modified
the comments.

Sawada-San, Euler, do you have any opinion on this approach? I
personally still prefer the approach implemented in v10 [1] especially
due to the latest finding by Wang-San that we can't update the
lag-tracker apart from when it is invoked at the transaction end.
However, I am fine if we like this approach more.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275E0C2B4D9E488AD7CBA209E1F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com
-- 
With Regards,
Amit Kapila.

Вложения

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > So I skip tracking lag during a transaction just like the current HEAD.
> > Attach the new patch.
> >
>
> Thanks, please find the updated patch where I have slightly modified
> the comments.
>
> Sawada-San, Euler, do you have any opinion on this approach? I
> personally still prefer the approach implemented in v10 [1] especially
> due to the latest finding by Wang-San that we can't update the
> lag-tracker apart from when it is invoked at the transaction end.
> However, I am fine if we like this approach more.

Thank you for updating the patch.

The current patch looks much better than v10 which requires to call to
update_progress() every path.

Regarding v15 patch, I'm concerned a bit that the new function name,
update_progress(), is too generic. How about
update_replation_progress() or something more specific name?

---
+        if (end_xact)
+        {
+                /* Update progress tracking at xact end. */
+                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
+                changes_count = 0;
+                return;
+        }
+
+        /*
+         * After continuously processing CHANGES_THRESHOLD changes,
we try to send
+         * a keepalive message if required.
+         *
+         * We don't want to try sending a keepalive message after
processing each
+         * change as that can have overhead. Testing reveals that there is no
+         * noticeable overhead in doing it after continuously
processing 100 or so
+         * changes.
+         */
+#define CHANGES_THRESHOLD 100
+        if (++changes_count >= CHANGES_THRESHOLD)
+        {
+                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
+                changes_count = 0;
+        }

Can we merge two if branches since we do the same things? Or did you
separate them for better readability?

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
"Euler Taveira"
Дата:
On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote:
On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
>
> So I skip tracking lag during a transaction just like the current HEAD.
> Attach the new patch.
>

Thanks, please find the updated patch where I have slightly modified
the comments.

Sawada-San, Euler, do you have any opinion on this approach? I
personally still prefer the approach implemented in v10 [1] especially
due to the latest finding by Wang-San that we can't update the
lag-tracker apart from when it is invoked at the transaction end.
However, I am fine if we like this approach more.
It seems v15 is simpler and less error prone than v10. v10 has a mix of
OutputPluginUpdateProgress() and the new function update_progress(). The v10
also calls update_progress() for every change action in pgoutput_change(). It
is not a good approach for maintainability -- new changes like sequences need
extra calls. However, as you mentioned there should handle the track lag case.

Both patches change the OutputPluginUpdateProgress() so it cannot be
backpatched. Are you planning to backpatch it? If so, the boolean variable
(last_write or end_xacts depending of which version you are considering) could
be added to LogicalDecodingContext. (You should probably consider this approach
for skipped_xact too)

+ * For a large transaction, if we don't send any change to the downstream for a
+ * long time then it can timeout. This can happen when all or most of the
+ * changes are either not published or got filtered out.

We should probable mention that "long time" is wal_receiver_timeout on
subscriber.

+    * change as that can have overhead. Testing reveals that there is no
+    * noticeable overhead in doing it after continuously processing 100 or so
+    * changes.

Tests revealed that ...

+    * We don't have a mechanism to get the ack for any LSN other than end xact
+    * lsn from the downstream. So, we track lag only for end xact lsn's.

s/lsn/LSN/ and s/lsn's/LSNs/

I would say "end of transaction LSN".

+ * If too many changes are processed then try to send a keepalive message to
+ * receiver to avoid timeouts.

In logical replication, if too many changes are processed then try to send a
keepalive message. It might avoid a timeout in the subscriber.

Does this same issue occur for long transactions? I mean keep a long
transaction open and execute thousands of transactions.

BEGIN;
INSERT INTO foo (a) VALUES(1);
-- wait a few hours while executing 10^x transactions
INSERT INTO foo (a) VALUES(2);
COMMIT;


--
Euler Taveira

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote:
>
> On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote:
>
> Sawada-San, Euler, do you have any opinion on this approach? I
> personally still prefer the approach implemented in v10 [1] especially
> due to the latest finding by Wang-San that we can't update the
> lag-tracker apart from when it is invoked at the transaction end.
> However, I am fine if we like this approach more.
>
> It seems v15 is simpler and less error prone than v10. v10 has a mix of
> OutputPluginUpdateProgress() and the new function update_progress(). The v10
> also calls update_progress() for every change action in pgoutput_change(). It
> is not a good approach for maintainability -- new changes like sequences need
> extra calls.
>

Okay, let's use the v15 approach as Sawada-San also seems to have a
preference for that.

> However, as you mentioned there should handle the track lag case.
>
> Both patches change the OutputPluginUpdateProgress() so it cannot be
> backpatched. Are you planning to backpatch it? If so, the boolean variable
> (last_write or end_xacts depending of which version you are considering) could
> be added to LogicalDecodingContext.
>

If we add it to LogicalDecodingContext then I think we have to always
reset the variable after its use which will make it look ugly and
error-prone. I was not thinking to backpatch it because of the API
change but I guess if we want to backpatch then we can add it to
LogicalDecodingContext for back-branches. I am not sure if that will
look committable but surely we can try.

> (You should probably consider this approach
> for skipped_xact too)
>

As mentioned, I think it will be more error-prone and we already have
other xact related parameters in that and similar APIs. So, I am not
sure why you want to prefer that?

>
> Does this same issue occur for long transactions? I mean keep a long
> transaction open and execute thousands of transactions.
>

No, this problem won't happen for such cases because we will only try
to send it at the commit time. Note that this problem happens only
when we don't send anything to the subscriber till a timeout happens.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > So I skip tracking lag during a transaction just like the current HEAD.
> > > Attach the new patch.
> > >
> >
> > Thanks, please find the updated patch where I have slightly modified
> > the comments.
> >
> > Sawada-San, Euler, do you have any opinion on this approach? I
> > personally still prefer the approach implemented in v10 [1] especially
> > due to the latest finding by Wang-San that we can't update the
> > lag-tracker apart from when it is invoked at the transaction end.
> > However, I am fine if we like this approach more.
>
> Thank you for updating the patch.
>
> The current patch looks much better than v10 which requires to call to
> update_progress() every path.
>
> Regarding v15 patch, I'm concerned a bit that the new function name,
> update_progress(), is too generic. How about
> update_replation_progress() or something more specific name?
>

Do you intend to say update_replication_progress()? The word
'replation' doesn't make sense to me. I am fine with this suggestion.

>
> ---
> +        if (end_xact)
> +        {
> +                /* Update progress tracking at xact end. */
> +                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
> +                changes_count = 0;
> +                return;
> +        }
> +
> +        /*
> +         * After continuously processing CHANGES_THRESHOLD changes,
> we try to send
> +         * a keepalive message if required.
> +         *
> +         * We don't want to try sending a keepalive message after
> processing each
> +         * change as that can have overhead. Testing reveals that there is no
> +         * noticeable overhead in doing it after continuously
> processing 100 or so
> +         * changes.
> +         */
> +#define CHANGES_THRESHOLD 100
> +        if (++changes_count >= CHANGES_THRESHOLD)
> +        {
> +                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
> +                changes_count = 0;
> +        }
>
> Can we merge two if branches since we do the same things? Or did you
> separate them for better readability?
>

I think it is fine to merge the two checks.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Mon, Apr 18, 2022 at 1:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> > > >
> > > > So I skip tracking lag during a transaction just like the current HEAD.
> > > > Attach the new patch.
> > > >
> > >
> > > Thanks, please find the updated patch where I have slightly modified
> > > the comments.
> > >
> > > Sawada-San, Euler, do you have any opinion on this approach? I
> > > personally still prefer the approach implemented in v10 [1] especially
> > > due to the latest finding by Wang-San that we can't update the
> > > lag-tracker apart from when it is invoked at the transaction end.
> > > However, I am fine if we like this approach more.
> >
> > Thank you for updating the patch.
> >
> > The current patch looks much better than v10 which requires to call to
> > update_progress() every path.
> >
> > Regarding v15 patch, I'm concerned a bit that the new function name,
> > update_progress(), is too generic. How about
> > update_replation_progress() or something more specific name?
> >
>
> Do you intend to say update_replication_progress()? The word
> 'replation' doesn't make sense to me. I am fine with this suggestion.

Yeah, that was a typo. I meant update_replication_progress().

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Apr 18, 2022 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote:
> >
> > Sawada-San, Euler, do you have any opinion on this approach? I
> > personally still prefer the approach implemented in v10 [1] especially
> > due to the latest finding by Wang-San that we can't update the
> > lag-tracker apart from when it is invoked at the transaction end.
> > However, I am fine if we like this approach more.
> >
> > It seems v15 is simpler and less error prone than v10. v10 has a mix of
> > OutputPluginUpdateProgress() and the new function update_progress(). The v10
> > also calls update_progress() for every change action in pgoutput_change(). It
> > is not a good approach for maintainability -- new changes like sequences need
> > extra calls.
> >
>
> Okay, let's use the v15 approach as Sawada-San also seems to have a
> preference for that.
>
> > However, as you mentioned there should handle the track lag case.
> >
> > Both patches change the OutputPluginUpdateProgress() so it cannot be
> > backpatched. Are you planning to backpatch it? If so, the boolean variable
> > (last_write or end_xacts depending of which version you are considering) could
> > be added to LogicalDecodingContext.
> >
>
> If we add it to LogicalDecodingContext then I think we have to always
> reset the variable after its use which will make it look ugly and
> error-prone. I was not thinking to backpatch it because of the API
> change but I guess if we want to backpatch then we can add it to
> LogicalDecodingContext for back-branches. I am not sure if that will
> look committable but surely we can try.
>

Even, if we want to add the variable in the struct in back-branches,
we need to ensure not to change the size of the struct as it is
exposed, see email [1] for a similar mistake we made in another case.

[1] - https://www.postgresql.org/message-id/2358496.1649168259%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Apr 18, 2022 at 00:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Apr 18, 2022 at 1:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> > >
> > > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > >
> > > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
> > > > <wangw.fnst@fujitsu.com> wrote:
> > > > >
> > > > > So I skip tracking lag during a transaction just like the current HEAD.
> > > > > Attach the new patch.
> > > > >
> > > >
> > > > Thanks, please find the updated patch where I have slightly
> > > > modified the comments.
> > > >
> > > > Sawada-San, Euler, do you have any opinion on this approach? I
> > > > personally still prefer the approach implemented in v10 [1]
> > > > especially due to the latest finding by Wang-San that we can't
> > > > update the lag-tracker apart from when it is invoked at the transaction end.
> > > > However, I am fine if we like this approach more.
> > >
> > > Thank you for updating the patch.
> > >
> > > The current patch looks much better than v10 which requires to call
> > > to
> > > update_progress() every path.
> > >
> > > Regarding v15 patch, I'm concerned a bit that the new function name,
> > > update_progress(), is too generic. How about
> > > update_replation_progress() or something more specific name?
> > >
> >
> > Do you intend to say update_replication_progress()? The word
> > 'replation' doesn't make sense to me. I am fine with this suggestion.
> 
> Yeah, that was a typo. I meant update_replication_progress().
Thanks for your comments.

> > > Regarding v15 patch, I'm concerned a bit that the new function name,
> > > update_progress(), is too generic. How about
> > > update_replation_progress() or something more specific name?
Improve as suggested. Change the name from update_progress to
update_replication_progress.

> > > ---
> > > +        if (end_xact)
> > > +        {
> > > +                /* Update progress tracking at xact end. */
> > > +                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
> > > +                changes_count = 0;
> > > +                return;
> > > +        }
> > > +
> > > +        /*
> > > +         * After continuously processing CHANGES_THRESHOLD changes,
> > > we try to send
> > > +         * a keepalive message if required.
> > > +         *
> > > +         * We don't want to try sending a keepalive message after
> > > processing each
> > > +         * change as that can have overhead. Testing reveals that there is no
> > > +         * noticeable overhead in doing it after continuously
> > > processing 100 or so
> > > +         * changes.
> > > +         */
> > > +#define CHANGES_THRESHOLD 100
> > > +        if (++changes_count >= CHANGES_THRESHOLD)
> > > +        {
> > > +                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
> > > +                changes_count = 0;
> > > +        }
> > > 
> > > Can we merge two if branches since we do the same things? Or did you
> > > separate them for better readability?
Improve as suggested. Merge two if-branches.

Attach the new patch.
1. Rename the new function(update_progress) to update_replication_progress. [suggestion by Sawada-San]
2. Merge two if-branches in new function update_replication_progress. [suggestion by Sawada-San.]
3. Improve comments to make them clear. [suggestions by Euler-San.]

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thur, Apr 14, 2022 at 8:21 PM Euler Taveira <euler@eulerto.com> wrote:
>
Thanks for your comments.

> + * For a large transaction, if we don't send any change to the downstream for a
> + * long time then it can timeout. This can happen when all or most of the
> + * changes are either not published or got filtered out.
>
> We should probable mention that "long time" is wal_receiver_timeout on
> subscriber.
Improve as suggested.
Add "(exceeds the wal_receiver_timeout of standby)" to explain what "long time"
means.

> +    * change as that can have overhead. Testing reveals that there is no
> +    * noticeable overhead in doing it after continuously processing 100 or so
> +    * changes.
>
> Tests revealed that ...
Improve as suggested.

> +    * We don't have a mechanism to get the ack for any LSN other than end xact
> +    * lsn from the downstream. So, we track lag only for end xact lsn's.
>
> s/lsn/LSN/ and s/lsn's/LSNs/
>
> I would say "end of transaction LSN".
Improve as suggested.

> + * If too many changes are processed then try to send a keepalive message to
> + * receiver to avoid timeouts.
>
> In logical replication, if too many changes are processed then try to send a
> keepalive message. It might avoid a timeout in the subscriber.
Improve as suggested.

Kindly have a look at new patch shared in [1].

[1] -
https://www.postgresql.org/message-id/OS3PR01MB627561344A2C7ECF68E41D7E9EF39%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Mon, Apr 18, 2022 at 3:16 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Apr 18, 2022 at 00:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Mon, Apr 18, 2022 at 1:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> > > >
> > > > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > > >
> > > > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com
> > > > > <wangw.fnst@fujitsu.com> wrote:
> > > > > >
> > > > > > So I skip tracking lag during a transaction just like the current HEAD.
> > > > > > Attach the new patch.
> > > > > >
> > > > >
> > > > > Thanks, please find the updated patch where I have slightly
> > > > > modified the comments.
> > > > >
> > > > > Sawada-San, Euler, do you have any opinion on this approach? I
> > > > > personally still prefer the approach implemented in v10 [1]
> > > > > especially due to the latest finding by Wang-San that we can't
> > > > > update the lag-tracker apart from when it is invoked at the transaction end.
> > > > > However, I am fine if we like this approach more.
> > > >
> > > > Thank you for updating the patch.
> > > >
> > > > The current patch looks much better than v10 which requires to call
> > > > to
> > > > update_progress() every path.
> > > >
> > > > Regarding v15 patch, I'm concerned a bit that the new function name,
> > > > update_progress(), is too generic. How about
> > > > update_replation_progress() or something more specific name?
> > > >
> > >
> > > Do you intend to say update_replication_progress()? The word
> > > 'replation' doesn't make sense to me. I am fine with this suggestion.
> >
> > Yeah, that was a typo. I meant update_replication_progress().
> Thanks for your comments.
>
> > > > Regarding v15 patch, I'm concerned a bit that the new function name,
> > > > update_progress(), is too generic. How about
> > > > update_replation_progress() or something more specific name?
> Improve as suggested. Change the name from update_progress to
> update_replication_progress.
>
> > > > ---
> > > > +        if (end_xact)
> > > > +        {
> > > > +                /* Update progress tracking at xact end. */
> > > > +                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
> > > > +                changes_count = 0;
> > > > +                return;
> > > > +        }
> > > > +
> > > > +        /*
> > > > +         * After continuously processing CHANGES_THRESHOLD changes,
> > > > we try to send
> > > > +         * a keepalive message if required.
> > > > +         *
> > > > +         * We don't want to try sending a keepalive message after
> > > > processing each
> > > > +         * change as that can have overhead. Testing reveals that there is no
> > > > +         * noticeable overhead in doing it after continuously
> > > > processing 100 or so
> > > > +         * changes.
> > > > +         */
> > > > +#define CHANGES_THRESHOLD 100
> > > > +        if (++changes_count >= CHANGES_THRESHOLD)
> > > > +        {
> > > > +                OutputPluginUpdateProgress(ctx, skipped_xact, end_xact);
> > > > +                changes_count = 0;
> > > > +        }
> > > >
> > > > Can we merge two if branches since we do the same things? Or did you
> > > > separate them for better readability?
> Improve as suggested. Merge two if-branches.
>
> Attach the new patch.
> 1. Rename the new function(update_progress) to update_replication_progress. [suggestion by Sawada-San]
> 2. Merge two if-branches in new function update_replication_progress. [suggestion by Sawada-San.]
> 3. Improve comments to make them clear. [suggestions by Euler-San.]

Thank you for updating the patch.

+ * For a large transaction, if we don't send any change to the downstream for a
+ * long time(exceeds the wal_receiver_timeout of standby) then it can timeout.
+ * This can happen when all or most of the changes are either not published or
+ * got filtered out.

+ */
+ if(end_xact || ++changes_count >= CHANGES_THRESHOLD)
+ {

We need a whitespace before '(' at above two places. The rest looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Apr 19, 2022 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch.
Thanks for your comments.

> + * For a large transaction, if we don't send any change to the
> + downstream for a
> + * long time(exceeds the wal_receiver_timeout of standby) then it can
> timeout.
> + * This can happen when all or most of the changes are either not
> + published or
> + * got filtered out.
> 
> + */
> + if(end_xact || ++changes_count >= CHANGES_THRESHOLD) {
> 
> We need a whitespace before '(' at above two places. The rest looks good to me.
Fix these.

Attach the new patch.
1. Fix wrong formatting. [suggestion by Sawada-San]

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Apr 18, 2022 at 00:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Apr 18, 2022 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote:
> > >
> > > Sawada-San, Euler, do you have any opinion on this approach? I
> > > personally still prefer the approach implemented in v10 [1]
> > > especially due to the latest finding by Wang-San that we can't
> > > update the lag-tracker apart from when it is invoked at the transaction end.
> > > However, I am fine if we like this approach more.
> > >
> > > It seems v15 is simpler and less error prone than v10. v10 has a mix
> > > of
> > > OutputPluginUpdateProgress() and the new function update_progress().
> > > The v10 also calls update_progress() for every change action in
> > > pgoutput_change(). It is not a good approach for maintainability --
> > > new changes like sequences need extra calls.
> > >
> >
> > Okay, let's use the v15 approach as Sawada-San also seems to have a
> > preference for that.
> >
> > > However, as you mentioned there should handle the track lag case.
> > >
> > > Both patches change the OutputPluginUpdateProgress() so it cannot be
> > > backpatched. Are you planning to backpatch it? If so, the boolean
> > > variable (last_write or end_xacts depending of which version you are
> > > considering) could be added to LogicalDecodingContext.
> > >
> >
> > If we add it to LogicalDecodingContext then I think we have to always
> > reset the variable after its use which will make it look ugly and
> > error-prone. I was not thinking to backpatch it because of the API
> > change but I guess if we want to backpatch then we can add it to
> > LogicalDecodingContext for back-branches. I am not sure if that will
> > look committable but surely we can try.
> >
> 
> Even, if we want to add the variable in the struct in back-branches, we need to
> ensure not to change the size of the struct as it is exposed, see email [1] for a
> similar mistake we made in another case.
> 
> [1] - https://www.postgresql.org/message-
> id/2358496.1649168259%40sss.pgh.pa.us
Thanks for your comments.

I did some checks about adding the new variable in LogicalDecodingContext.
I found that because of padding, if we add the new variable at the end of
structure, it dose not make the structure size change. I verified this in
REL_10~REL_14.

So as suggested by Euler-San and Amit-San, I wrote the patch for REL_14. Attach
this patch. To prevent patch confusion, the patch for HEAD is also attached.
The patch for REL_14:
    REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
The patch for HEAD:
    v17-0001-Fix-the-logical-replication-timeout-during-large.patch

The following is the details of checks.
On gcc/Linux/x86-64, in REL_14, by looking at the size of each member variable
in the structure LogicalDecodingContext, I found that there are three parts
padding behind the following member variables:
- 7 bytes after fast_forward
- 4 bytes after prepared_write
- 4 bytes after write_xid

If we add the new variable at the end of structure (bool takes one byte), it
means we will only consume one byte of padding after member write_xid. And
then, at the end of the struct, 3 padding are still required. For easy
understanding, please refer to the following simple calculation.
(In REL14, the size of structure LogicalDecodingContext is 304 bytes.)
Before adding new variable (In REL14):
8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4  =  ‭289 (if padding is not considered)
         +7                          +4  +4  =  +15 (the padding)
So, the size of structure LogicalDecodingContext is 289+15=304.
After adding new variable (In REL14 with patch):
8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4+1  =  ‭290‬ (if padding is not considered)
         +7                          +4    +3  =  +14 (the padding)
So, the size of structure LogicalDecodingContext is 290+14=304.

BTW, the size of structure LogicalDecodingContext in REL_10~REL_13 is 184, 200,
200,200 respectively. And I found that at the end of the structure
LogicalDecodingContext, there are always the following members:
```
    XLogRecPtr  write_location;   --> 8
    TransactionId write_xid;      --> 4
                                  --> There are 4 padding after write_xid.
```
It means at the end of structure LogicalDecodingContext, there are 4 bytes
padding. So, if we add a new bool type variable (It takes one byte) at the end
of the structure LogicalDecodingContext, I think in the current REL_10~REL_14,
because of padding, the size of the structure LogicalDecodingContext will not
change.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Apr 18, 2022 at 00:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Apr 18, 2022 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote:
> > > >
> > > > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote:
> > > >
> > > > Sawada-San, Euler, do you have any opinion on this approach? I
> > > > personally still prefer the approach implemented in v10 [1]
> > > > especially due to the latest finding by Wang-San that we can't
> > > > update the lag-tracker apart from when it is invoked at the transaction end.
> > > > However, I am fine if we like this approach more.
> > > >
> > > > It seems v15 is simpler and less error prone than v10. v10 has a mix
> > > > of
> > > > OutputPluginUpdateProgress() and the new function update_progress().
> > > > The v10 also calls update_progress() for every change action in
> > > > pgoutput_change(). It is not a good approach for maintainability --
> > > > new changes like sequences need extra calls.
> > > >
> > >
> > > Okay, let's use the v15 approach as Sawada-San also seems to have a
> > > preference for that.
> > >
> > > > However, as you mentioned there should handle the track lag case.
> > > >
> > > > Both patches change the OutputPluginUpdateProgress() so it cannot be
> > > > backpatched. Are you planning to backpatch it? If so, the boolean
> > > > variable (last_write or end_xacts depending of which version you are
> > > > considering) could be added to LogicalDecodingContext.
> > > >
> > >
> > > If we add it to LogicalDecodingContext then I think we have to always
> > > reset the variable after its use which will make it look ugly and
> > > error-prone. I was not thinking to backpatch it because of the API
> > > change but I guess if we want to backpatch then we can add it to
> > > LogicalDecodingContext for back-branches. I am not sure if that will
> > > look committable but surely we can try.
> > >
> >
> > Even, if we want to add the variable in the struct in back-branches, we need to
> > ensure not to change the size of the struct as it is exposed, see email [1] for a
> > similar mistake we made in another case.
> >
> > [1] - https://www.postgresql.org/message-
> > id/2358496.1649168259%40sss.pgh.pa.us
> Thanks for your comments.
>
> I did some checks about adding the new variable in LogicalDecodingContext.
> I found that because of padding, if we add the new variable at the end of
> structure, it dose not make the structure size change. I verified this in
> REL_10~REL_14.
>
> So as suggested by Euler-San and Amit-San, I wrote the patch for REL_14. Attach
> this patch. To prevent patch confusion, the patch for HEAD is also attached.
> The patch for REL_14:
>     REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
> The patch for HEAD:
>     v17-0001-Fix-the-logical-replication-timeout-during-large.patch
>
> The following is the details of checks.
> On gcc/Linux/x86-64, in REL_14, by looking at the size of each member variable
> in the structure LogicalDecodingContext, I found that there are three parts
> padding behind the following member variables:
> - 7 bytes after fast_forward
> - 4 bytes after prepared_write
> - 4 bytes after write_xid
>
> If we add the new variable at the end of structure (bool takes one byte), it
> means we will only consume one byte of padding after member write_xid. And
> then, at the end of the struct, 3 padding are still required. For easy
> understanding, please refer to the following simple calculation.
> (In REL14, the size of structure LogicalDecodingContext is 304 bytes.)
> Before adding new variable (In REL14):
> 8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4  =  ‭289 (if padding is not considered)
>          +7                          +4  +4  =  +15 (the padding)
> So, the size of structure LogicalDecodingContext is 289+15=304.
> After adding new variable (In REL14 with patch):
> 8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4+1  =  ‭290‬ (if padding is not considered)
>          +7                          +4    +3  =  +14 (the padding)
> So, the size of structure LogicalDecodingContext is 290+14=304.
>
> BTW, the size of structure LogicalDecodingContext in REL_10~REL_13 is 184, 200,
> 200,200 respectively. And I found that at the end of the structure
> LogicalDecodingContext, there are always the following members:
> ```
>     XLogRecPtr  write_location;   --> 8
>     TransactionId write_xid;      --> 4
>                                   --> There are 4 padding after write_xid.
> ```

I'm concerned that this 4-byte padding at the end of the struct could
depend on platforms (there might be no padding in 32-bit platforms?).
It seems to me that it's better to put it after fast_forward where the
new field should fall within the padding space.

BTW the changes in
REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch,
adding end_xact to LogicalDecodingContext, seems good to me and it
might be better than the approach of v17 patch from plugin developers’
perspective? This is because they won’t need to pass true/false to
end_xact of  OutputPluginUpdateProgress(). Furthermore, if we do what
we do in update_replication_progress() in
OutputPluginUpdateProgress(), what plugins need to do will be just to
call OutputPluginUpdate() in every callback and they don't need to
have the CHANGES_THRESHOLD logic. What do you think?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> > ```
>
> I'm concerned that this 4-byte padding at the end of the struct could
> depend on platforms (there might be no padding in 32-bit platforms?).
>

Good point, but ...

> It seems to me that it's better to put it after fast_forward where the
> new field should fall within the padding space.
>

Can we add the variable in between the existing variables in the
structure in the back branches? Normally, we add at the end to avoid
any breakage of existing apps. See commit 56e366f675 and discussion at
[1]. That is related to enum but I think we follow the same for
structures.

[1] - https://www.postgresql.org/message-id/7dab0929-a966-0c0a-4726-878fced2fe00%40enterprisedb.com
-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Apr 20, 2022 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > > ```
> >
> > I'm concerned that this 4-byte padding at the end of the struct could
> > depend on platforms (there might be no padding in 32-bit platforms?).
> >
>
> Good point, but ...
>
> > It seems to me that it's better to put it after fast_forward where the
> > new field should fall within the padding space.
> >
>
> Can we add the variable in between the existing variables in the
> structure in the back branches?
>

I think it should be fine if it falls in the padding space. We have
done similar changes recently in back-branches [1]. I think it would
be then better to have it in the same place in HEAD as well?

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10520f4346876aad4941797c2255a21bdac74739

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, Apr 20, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 20, 2022 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> > > > ```
> > >
> > > I'm concerned that this 4-byte padding at the end of the struct could
> > > depend on platforms (there might be no padding in 32-bit platforms?).
> > >
> >
> > Good point, but ...
> >
> > > It seems to me that it's better to put it after fast_forward where the
> > > new field should fall within the padding space.
> > >
> >
> > Can we add the variable in between the existing variables in the
> > structure in the back branches?
> >
>
> I think it should be fine if it falls in the padding space. We have
> done similar changes recently in back-branches [1].

Yes.

> I think it would
> be then better to have it in the same place in HEAD as well?

As far as I can see in the v17 patch, which is for HEAD, we don't add
a variable to LogicalDecodingContext, but did you refer to another
patch?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Apr 20, 2022 at 6:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 20, 2022 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> > > > ```
> > >
> > > I'm concerned that this 4-byte padding at the end of the struct could
> > > depend on platforms (there might be no padding in 32-bit platforms?).
> > >
> >
> > Good point, but ...
> >
> > > It seems to me that it's better to put it after fast_forward where the
> > > new field should fall within the padding space.
> > >
> >
> > Can we add the variable in between the existing variables in the
> > structure in the back branches?
> >
> 
> I think it should be fine if it falls in the padding space. We have
> done similar changes recently in back-branches [1]. I think it would
> be then better to have it in the same place in HEAD as well?
> 
> [1] -
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10520f4346
> 876aad4941797c2255a21bdac74739
Thanks for your comments.

The comments by Sawada-San sound reasonable to me.
After doing check, I found that padding in HEAD is the same as in REL14.
So I change the approach of patch for HEAD just like the patch for REL14.

On Wed, Apr 20, 2022 at 3:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I'm concerned that this 4-byte padding at the end of the struct could
> depend on platforms (there might be no padding in 32-bit platforms?).
> It seems to me that it's better to put it after fast_forward where the
> new field should fall within the padding space.
Fixed. Add new variable after fast_forward.

> BTW the changes in
> REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch,
> adding end_xact to LogicalDecodingContext, seems good to me and it
> might be better than the approach of v17 patch from plugin developers’
> perspective? This is because they won’t need to pass true/false to
> end_xact of  OutputPluginUpdateProgress(). Furthermore, if we do what
> we do in update_replication_progress() in
> OutputPluginUpdateProgress(), what plugins need to do will be just to
> call OutputPluginUpdate() in every callback and they don't need to
> have the CHANGES_THRESHOLD logic. What do you think?
Change the approach of patch for HEAD. (The size of structure does not change.)
Also move the logical of function update_replication_progress to function
OutputPluginUpdateProgress.

Attach the patches. [suggestion by Sawada-San]
1. Change the position of the new variable in structure.
2. Change the approach of the patch for HEAD.
3. Delete the new function update_replication_progress.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Apr 20, 2022 at 6:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Apr 20, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > I think it would
> > be then better to have it in the same place in HEAD as well?
>
> As far as I can see in the v17 patch, which is for HEAD, we don't add
> a variable to LogicalDecodingContext, but did you refer to another
> patch?
>

No, I thought it is better to follow the same approach in HEAD as
well. Do you see any problem with it?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Thu, Apr 21, 2022 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 20, 2022 at 6:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Apr 20, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > I think it would
> > > be then better to have it in the same place in HEAD as well?
> >
> > As far as I can see in the v17 patch, which is for HEAD, we don't add
> > a variable to LogicalDecodingContext, but did you refer to another
> > patch?
> >
>
> No, I thought it is better to follow the same approach in HEAD as
> well. Do you see any problem with it?

No, that makes sense to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Apr 21, 2022 at 10:15 AM I wrote:
> The comments by Sawada-San sound reasonable to me.
> After doing check, I found that padding in HEAD is the same as in REL14.
> So I change the approach of patch for HEAD just like the patch for REL14.

Also attach the back-branch patches for REL10~REL13.
(REL12 and REL11 patch are the same, so only post one patch for these two
branches.)

The patch for HEAD:
    HEAD_v18-0001-Fix-the-logical-replication-timeout-during-large.patch
The patch for REL14:
    REL14_v2-0001-Fix-the-logical-replication-timeout-during-large-.patch
The patch for REL13:
    REL13_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
The patch for REL12 and REL11:
    REL12-REL11_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
The patch for REL10:
    REL10_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch

BTW, after doing check, I found that padding in REL11~REL13 are similar as HEAD
and REL14 (7 bytes padding after fast_forward). But in REL10, the padding is
different. There are three parts padding behind the following member variables:
- 4 bytes after options
- 6 bytes after prepared_write
- 4 bytes after write_xid
So, in the patches for branches REL11~HEAD, I add the new variable after
fast_forward. In the patch for branch REL10, I add the new variable after
prepared_write.
For each version, the size of the structure does not change after applying the
patch.

Regards,
Wang wei

Вложения

RE: Logical replication timeout problem

От
"houzj.fnst@fujitsu.com"
Дата:
On Wednesday, April 20, 2022 3:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> BTW the changes in
> REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch,
> adding end_xact to LogicalDecodingContext, seems good to me and it
> might be better than the approach of v17 patch from plugin developers’
> perspective? This is because they won’t need to pass true/false to
> end_xact of  OutputPluginUpdateProgress(). Furthermore, if we do what
> we do in update_replication_progress() in
> OutputPluginUpdateProgress(), what plugins need to do will be just to
> call OutputPluginUpdate() in every callback and they don't need to
> have the CHANGES_THRESHOLD logic. What do you think?

Hi Sawada-san, Wang

I was looking at the patch and noticed that we moved some logic from
update_replication_progress() to OutputPluginUpdateProgress() like
your suggestion.

I have a question about this change. In the patch we added some
restriction in this function OutputPluginUpdateProgress() like below:

+ /*
+ * If we are at the end of transaction LSN, update progress tracking.
+ * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we
+ * try to send a keepalive message if required.
+ */
+ if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD)
+ {
+ ctx->update_progress(ctx, ctx->write_location, ctx->write_xid,
+ skipped_xact);
+ changes_count = 0;
+ }

After the patch, we won't be able to always invoke the update_progress() if the
caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the
public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to
change this at back-branches ?

Because OutputPluginUpdateProgress() is a public function for the extension
developer, just a little concerned about the behavior change here.

Besides, the check of 'end_xact' and the 'CHANGES_THRESHOLD' seems specified to
the pgoutput. I am not very sure that if plugin author also needs these
logic(they might want to change the strategy), so I'd like to confirm it with
you.

Best regards,
Hou zj


Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Apr 21, 2022 at 3:21 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>

I think it is better to keep the new variable 'end_xact' at the end of
the struct where it belongs for HEAD. In back branches, we can keep it
at the place as you have. Apart from that, I have made some cosmetic
changes and changed a few comments in the attached. Let's use this to
prepare patches for back-branches.

-- 
With Regards,
Amit Kapila.

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thur, Apr 28, 2022 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Apr 21, 2022 at 3:21 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> 
> I think it is better to keep the new variable 'end_xact' at the end of
> the struct where it belongs for HEAD. In back branches, we can keep it
> at the place as you have. Apart from that, I have made some cosmetic
> changes and changed a few comments in the attached. Let's use this to
> prepare patches for back-branches.
Thanks for your review and improvement.

I improved the back-branch patches according to your modifications.
Attach the back-branch patches for REL10~REL14.
(Also attach the patch for HEAD, I did not make any changes to this patch.)

BTW, I found Hou-san shared some points. After our discussion, I will update
the patches if required.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Thu, Apr 28, 2022 at 7:01 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, April 20, 2022 3:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > BTW the changes in
> > REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch,
> > adding end_xact to LogicalDecodingContext, seems good to me and it
> > might be better than the approach of v17 patch from plugin developers’
> > perspective? This is because they won’t need to pass true/false to
> > end_xact of  OutputPluginUpdateProgress(). Furthermore, if we do what
> > we do in update_replication_progress() in
> > OutputPluginUpdateProgress(), what plugins need to do will be just to
> > call OutputPluginUpdate() in every callback and they don't need to
> > have the CHANGES_THRESHOLD logic. What do you think?
>
> Hi Sawada-san, Wang
>
> I was looking at the patch and noticed that we moved some logic from
> update_replication_progress() to OutputPluginUpdateProgress() like
> your suggestion.
>
> I have a question about this change. In the patch we added some
> restriction in this function OutputPluginUpdateProgress() like below:
>
> + /*
> + * If we are at the end of transaction LSN, update progress tracking.
> + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we
> + * try to send a keepalive message if required.
> + */
> + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD)
> + {
> + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid,
> + skipped_xact);
> + changes_count = 0;
> + }
>
> After the patch, we won't be able to always invoke the update_progress() if the
> caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the
> public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to
> change this at back-branches ?
>
> Because OutputPluginUpdateProgress() is a public function for the extension
> developer, just a little concerned about the behavior change here.

Good point.

As you pointed out, it would not be good if we change the behavior of
OutputPluginUpdateProgress() in back branches. Also, after more
thought, it is not a good idea even for HEAD since there might be
background workers that use logical decoding and the timeout checking
might not be relevant at all with them.

BTW, I think you're concerned about the plugins that call
OutputPluginUpdateProgress() at the middle of the transaction (i.e.,
end_xact = false). We have the following change that makes the
walsender not update the progress at the middle of the transaction. Do
you think it is okay since it's not common usage to call
OutputPluginUpdateProgress() at the middle of the transaction by the
plugin that is used by the walsender?

 #define WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS 1000
-     if (!TimestampDifferenceExceeds(sendTime, now,
+     if (end_xact && TimestampDifferenceExceeds(sendTime, now,
      WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS))
-         return;
+     {
+         LagTrackerWrite(lsn, now);
+         sendTime = now;
+     }

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, May 2, 2022 at 7:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Apr 28, 2022 at 7:01 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Hi Sawada-san, Wang
> >
> > I was looking at the patch and noticed that we moved some logic from
> > update_replication_progress() to OutputPluginUpdateProgress() like
> > your suggestion.
> >
> > I have a question about this change. In the patch we added some
> > restriction in this function OutputPluginUpdateProgress() like below:
> >
> > + /*
> > + * If we are at the end of transaction LSN, update progress tracking.
> > + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we
> > + * try to send a keepalive message if required.
> > + */
> > + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD)
> > + {
> > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid,
> > + skipped_xact);
> > + changes_count = 0;
> > + }
> >
> > After the patch, we won't be able to always invoke the update_progress() if the
> > caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the
> > public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to
> > change this at back-branches ?
> >
> > Because OutputPluginUpdateProgress() is a public function for the extension
> > developer, just a little concerned about the behavior change here.
>
> Good point.
>
> As you pointed out, it would not be good if we change the behavior of
> OutputPluginUpdateProgress() in back branches. Also, after more
> thought, it is not a good idea even for HEAD since there might be
> background workers that use logical decoding and the timeout checking
> might not be relevant at all with them.
>

So, shall we go back to the previous approach of using a separate
function update_replication_progress?

> BTW, I think you're concerned about the plugins that call
> OutputPluginUpdateProgress() at the middle of the transaction (i.e.,
> end_xact = false). We have the following change that makes the
> walsender not update the progress at the middle of the transaction. Do
> you think it is okay since it's not common usage to call
> OutputPluginUpdateProgress() at the middle of the transaction by the
> plugin that is used by the walsender?
>

We have done that purposefully as otherwise, the lag tracker shows
incorrect information. See email [1]. The reason is that we always get
ack from subscribers for transaction end. Also, prior to this patch we
never call the lag tracker recording apart from the transaction end,
so as a bug fix we shouldn't try to change it.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 2, 2022 at 7:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Thu, Apr 28, 2022 at 7:01 PM houzj.fnst@fujitsu.com
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Hi Sawada-san, Wang
> > >
> > > I was looking at the patch and noticed that we moved some logic from
> > > update_replication_progress() to OutputPluginUpdateProgress() like
> > > your suggestion.
> > >
> > > I have a question about this change. In the patch we added some
> > > restriction in this function OutputPluginUpdateProgress() like below:
> > >
> > > + /*
> > > + * If we are at the end of transaction LSN, update progress tracking.
> > > + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we
> > > + * try to send a keepalive message if required.
> > > + */
> > > + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD)
> > > + {
> > > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid,
> > > + skipped_xact);
> > > + changes_count = 0;
> > > + }
> > >
> > > After the patch, we won't be able to always invoke the update_progress() if the
> > > caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the
> > > public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to
> > > change this at back-branches ?
> > >
> > > Because OutputPluginUpdateProgress() is a public function for the extension
> > > developer, just a little concerned about the behavior change here.
> >
> > Good point.
> >
> > As you pointed out, it would not be good if we change the behavior of
> > OutputPluginUpdateProgress() in back branches. Also, after more
> > thought, it is not a good idea even for HEAD since there might be
> > background workers that use logical decoding and the timeout checking
> > might not be relevant at all with them.
> >
>
> So, shall we go back to the previous approach of using a separate
> function update_replication_progress?

Ok, agreed.

>
> > BTW, I think you're concerned about the plugins that call
> > OutputPluginUpdateProgress() at the middle of the transaction (i.e.,
> > end_xact = false). We have the following change that makes the
> > walsender not update the progress at the middle of the transaction. Do
> > you think it is okay since it's not common usage to call
> > OutputPluginUpdateProgress() at the middle of the transaction by the
> > plugin that is used by the walsender?
> >
>
> We have done that purposefully as otherwise, the lag tracker shows
> incorrect information. See email [1]. The reason is that we always get
> ack from subscribers for transaction end. Also, prior to this patch we
> never call the lag tracker recording apart from the transaction end,
> so as a bug fix we shouldn't try to change it.

Make sense.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > So, shall we go back to the previous approach of using a separate
> > function update_replication_progress?
>
> Ok, agreed.
>

Attached, please find the updated patch accordingly. Currently, I have
prepared it for HEAD, if you don't see any problem with this, we can
prepare the back-branch patches based on this.

-- 
With Regards,
Amit Kapila.

Вложения

Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > So, shall we go back to the previous approach of using a separate
> > > function update_replication_progress?
> >
> > Ok, agreed.
> >
>
> Attached, please find the updated patch accordingly. Currently, I have
> prepared it for HEAD, if you don't see any problem with this, we can
> prepare the back-branch patches based on this.

Thank you for updating the patch. Looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, May 6, 2022 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> > >
> > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > >
> > > >
> > > > So, shall we go back to the previous approach of using a separate
> > > > function update_replication_progress?
> > >
> > > Ok, agreed.
> > >
> >
> > Attached, please find the updated patch accordingly. Currently, I have
> > prepared it for HEAD, if you don't see any problem with this, we can
> > prepare the back-branch patches based on this.
> 
> Thank you for updating the patch. Looks good to me.
Thanks for your review.

Improve the back-branch patches according to the discussion.
Move the CHANGES_THRESHOLD logic from function OutputPluginUpdateProgress to
new funcion update_replication_progress.
In addition, improve all patches formatting with pgindent.

Attach the patches.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, May 6, 2022 at 12:42 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Fri, May 6, 2022 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> > > >
> > > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > > >
> > > > >
> > > > > So, shall we go back to the previous approach of using a separate
> > > > > function update_replication_progress?
> > > >
> > > > Ok, agreed.
> > > >
> > >
> > > Attached, please find the updated patch accordingly. Currently, I have
> > > prepared it for HEAD, if you don't see any problem with this, we can
> > > prepare the back-branch patches based on this.
> >
> > Thank you for updating the patch. Looks good to me.
> Thanks for your review.
>
> Improve the back-branch patches according to the discussion.
>

Thanks. The patch LGTM. I'll push and back-patch this after the
current minor release is done unless there are more comments related
to this work.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Masahiko Sawada
Дата:
On Mon, May 9, 2022 at 3:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 6, 2022 at 12:42 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Fri, May 6, 2022 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > > wrote:
> > > > >
> > > > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > So, shall we go back to the previous approach of using a separate
> > > > > > function update_replication_progress?
> > > > >
> > > > > Ok, agreed.
> > > > >
> > > >
> > > > Attached, please find the updated patch accordingly. Currently, I have
> > > > prepared it for HEAD, if you don't see any problem with this, we can
> > > > prepare the back-branch patches based on this.
> > >
> > > Thank you for updating the patch. Looks good to me.
> > Thanks for your review.
> >
> > Improve the back-branch patches according to the discussion.
> >
>

The patches look good to me too.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical replication timeout problem

От
"Euler Taveira"
Дата:
On Mon, May 9, 2022, at 3:47 AM, Amit Kapila wrote:
Thanks. The patch LGTM. I'll push and back-patch this after the
current minor release is done unless there are more comments related
to this work.
Looks sane to me. (I only tested the HEAD version)

+   bool        end_xact = ctx->end_xact;

Do you really need a new variable here? It has the same name and the new one
isn't changed during the execution.

Does this issue deserve a test? A small wal_receiver_timeout. Although, I'm not
sure how stable the test will be.


--
Euler Taveira

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, May 9, 2022 at 7:01 PM Euler Taveira <euler@eulerto.com> wrote:
>
> On Mon, May 9, 2022, at 3:47 AM, Amit Kapila wrote:
>
> Thanks. The patch LGTM. I'll push and back-patch this after the
> current minor release is done unless there are more comments related
> to this work.
>
> Looks sane to me. (I only tested the HEAD version)
>
> +   bool        end_xact = ctx->end_xact;
>
> Do you really need a new variable here? It has the same name and the new one
> isn't changed during the execution.
>

I think both ways should be okay. I thought the proposed way is okay
because it is used in more than one place and is probably slightly
easier to follow by having a separate variable.

> Does this issue deserve a test? A small wal_receiver_timeout. Although, I'm not
> sure how stable the test will be.
>

Yes, the main part is how to write a stable test because estimating
how many changes are enough for the configured wal_receiver_timeout to
pass on all the buildfarm machines is tricky. Also, I am not sure how
important is to test this behavior because based on this theory we
should have tests for all kinds of timeouts.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, May 9, 2022 at 11:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, May 9, 2022 at 7:01 PM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Mon, May 9, 2022, at 3:47 AM, Amit Kapila wrote:
> >
> > Thanks. The patch LGTM. I'll push and back-patch this after the
> > current minor release is done unless there are more comments related
> > to this work.
> > ......
> > Does this issue deserve a test? A small wal_receiver_timeout. Although, I'm
> not
> > sure how stable the test will be.
> >
> 
> Yes, the main part is how to write a stable test because estimating
> how many changes are enough for the configured wal_receiver_timeout to
> pass on all the buildfarm machines is tricky. Also, I am not sure how
> important is to test this behavior because based on this theory we
> should have tests for all kinds of timeouts.
Yse, I think we could not guarantee the stability of this test.
In addition, if we set wal_receiver_timeout too small, it may cause timeout
unrelated to this bug. And if we set bigger wal_receiver_timeout and use larger
transaction in order to minimize the impact of machine performance, I think
this might take some time and might risk making the build farm slow.

Regards,
Wang wei

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, May 9, 2022 at 2:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> The patches look good to me too.
>

Pushed.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Hello Amit,

In version 14.4 the timeout problem for logical replication happens again despite the patch provided for this issue in this version. When bulky materialized views are reloaded it broke logical replication. It is possible to solve this problem by using your new "streaming" option.
Have you ever had this issue reported to you?

Regards 

Fabrice

2022-10-10 17:19:02 CEST [538424]: [17-1] user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH MATERIALIZED VIEW sxxxa00.table_base"
        PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
2022-10-10 17:19:02 CEST [538424]: [18-1] user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:02 CEST [538424]: [19-1] user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652 ms  statement: select refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
        pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
        tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable, oldest xmin: 1501528
        index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
        I/O timings: read: 1.494 ms, write: 0.000 ms
        avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
        buffer usage: 593301 hits, 77 misses, 294605 dirtied
        WAL usage: 296644 records, 46119 full page images, 173652718 bytes
        system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic analyze of table "dbxxxa00.sxxxa00.table_base"
        I/O timings: read: 0.043 ms, write: 0.000 ms
        avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
        buffer usage: 30308 hits, 2 misses, 2 dirtied
        system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:  checkpoint complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:  terminating logical replication worker due to timeout
2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:  background worker "logical replication worker" (PID 716163) exited with exit code 1
2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical replication apply worker for subscription "subxxx_sxxxa00" has started

On Fri, Apr 1, 2022 at 6:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
>
> This is exactly our initial analysis and we have tried a patch on
> these lines and it has a noticeable overhead. See [1]. Calling this
> for each change or each skipped change can bring noticeable overhead
> that is why we decided to call it after a certain threshold (100) of
> skipped changes. Now, surely as mentioned in my previous reply we can
> make it generic such that instead of calling this (update_progress
> function as in the patch) for skipped cases, we call it always. Will
> that make it better?
>
> That's what I have in mind but using a different approach.
>
> > The functions CreateInitDecodingContext and CreateDecodingContext receives the
> > update_progress function as a parameter. These functions are called in 2
> > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)
> > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > WalSndUpdateProgress as a progress function. Case (b) does not have one because
> > it is not required -- local decoding/communication. There is no custom update
> > progress routine for each output plugin which leads me to the question:
> > couldn't we encapsulate the update progress call into the callback functions?
> >
>
> Sorry, I don't get your point. What exactly do you mean by this?
> AFAIS, currently we call this output plugin API in pgoutput functions
> only, do you intend to get it invoked from a different place?
>
> It seems I didn't make myself clear. The callbacks I'm referring to the
> *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> *_cb_wrapper() function, we have something like:
>
> if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
>     NewUpdateProgress(ctx, false);
>
> The NewUpdateProgress function would contain a logic similar to the
> update_progress() from the proposed patch. (A different function name here just
> to avoid confusion.)
>
> The output plugin is responsible to set ctx->progress with the callback
> variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) that we would
> like to run NewUpdateProgress.
>

This sounds like a conflicting approach to what we currently do.
Currently, OutputPluginUpdateProgress() is called from the xact
related pgoutput functions like pgoutput_commit_txn(),
pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
follow what you are saying then for some of the APIs like
pgoutput_change/_message/_truncate, we need to set the parameter to
invoke NewUpdateProgress() which will internally call
OutputPluginUpdateProgress(), and for the remaining APIs, we will call
in the corresponding pgoutput_* function. I feel if we want to make it
more generic than the current patch, it is better to directly call
what you are referring to here as NewUpdateProgress() in all remaining
APIs like pgoutput_change/_truncate, etc.

--
With Regards,
Amit Kapila.

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
>         PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
>         pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
>         index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
>         I/O timings: read: 1.494 ms, write: 0.000 ms
>         avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
>         buffer usage: 593301 hits, 77 misses, 294605 dirtied
>         WAL usage: 296644 records, 46119 full page images, 173652718 bytes
>         system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
>         I/O timings: read: 0.043 ms, write: 0.000 ms
>         avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
>         buffer usage: 30308 hits, 2 misses, 2 dirtied
>         system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:  checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:  terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:  background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> replication apply worker for subscription "subxxx_sxxxa00" has started

Thanks for reporting!

There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?

If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput 
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.

I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?

If this draft patch works, I will improve it and try to fix this problem.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Yes the refresh of MV is on the Publisher Side.
Thanks for your draft patch, I'll try it
I'll back to you as soonas possible

One question: why the refresh of the MV is a DDL not a DML?

Regards

Fabrice 

On Wed, 19 Oct 2022, 10:15 wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
>         PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
>         pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
>         index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
>         I/O timings: read: 1.494 ms, write: 0.000 ms
>         avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
>         buffer usage: 593301 hits, 77 misses, 294605 dirtied
>         WAL usage: 296644 records, 46119 full page images, 173652718 bytes
>         system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
>         I/O timings: read: 0.043 ms, write: 0.000 ms
>         avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
>         buffer usage: 30308 hits, 2 misses, 2 dirtied
>         system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:  checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:  terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:  background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> replication apply worker for subscription "subxxx_sxxxa00" has started

Thanks for reporting!

There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?

If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.

I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?

If this draft patch works, I will improve it and try to fix this problem.

Regards,
Wang wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thurs, Oct 20, 2022 at 13:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Yes the refresh of MV is on the Publisher Side.
> Thanks for your draft patch, I'll try it
> I'll back to you as soonas possible

Thanks a lot.

> One question: why the refresh of the MV is a DDL not a DML?

Since in the source, the type of command `REFRESH MATERIALIZED VIEW` is
`CMD_UTILITY`, I think this command is DDL (see CmdType in file nodes.h).

BTW, after trying to search for DML in the pg-doc, I found the relevant
description in the below link:
https://www.postgresql.org/docs/devel/logical-replication-publication.html

Regards,
Wang wei

Re: Logical replication timeout problem

От
Fabrice Chapuis
Дата:
Hello Wang,
I tested the draft patch in my lab for Postgres 14.4, the refresh of the materialized view ran without generating the timeout on the worker.
Do you plan to propose this patch at the next commit fest.

Regards,
Fabrice

On Wed, Oct 19, 2022 at 10:15 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
>         PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
>         pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
>         index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
>         I/O timings: read: 1.494 ms, write: 0.000 ms
>         avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
>         buffer usage: 593301 hits, 77 misses, 294605 dirtied
>         WAL usage: 296644 records, 46119 full page images, 173652718 bytes
>         system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
>         I/O timings: read: 0.043 ms, write: 0.000 ms
>         avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
>         buffer usage: 30308 hits, 2 misses, 2 dirtied
>         system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:  checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:  terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:  background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> replication apply worker for subscription "subxxx_sxxxa00" has started

Thanks for reporting!

There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?

If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.

I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?

If this draft patch works, I will improve it and try to fix this problem.

Regards,
Wang wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Nov 4, 2022 at 18:13 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Wang,
> 
> I tested the draft patch in my lab for Postgres 14.4, the refresh of the
> materialized view ran without generating the timeout on the worker.
> Do you plan to propose this patch at the next commit fest.

Thanks for your confirmation!
I will add this thread to the commit fest soon.

The following is the problem analysis and fix approach:
I think the problem is when there is a DDL in a transaction that generates lots
of temporary data due to rewrite rules, these temporary data will not be
processed by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for
DML had no impact on this case.

To fix this, I think we need to try to send the keepalive messages after each
change is processed by walsender, not in the pgoutput-plugin.

Attach the patch.

Regards,
Wang wei

Вложения

Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
Hi Wang,
Thanks for working on this. One of our customer faced a similar
situation when running BDR with PostgreSQL.

I tested your patch and it solves the problem.

Please find some review comments below

On Tue, Nov 8, 2022 at 8:34 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
>
> Attach the patch.
>

+/*
+ * Helper function for ReorderBufferProcessTXN for updating progress.
+ */
+static inline void
+ReorderBufferUpdateProgress(ReorderBuffer *rb, ReorderBufferTXN *txn,
+                            ReorderBufferChange *change)
+{
+    LogicalDecodingContext *ctx = rb->private_data;
+    static int    changes_count = 0;

It's not easy to know that a variable is static when reading the code which
uses it. So it's easy to interpret code wrong. I would probably track it
through logical decoding context itself OR through a global variable like other
places where we track the last timestamps. But there's more below on this.

+
+    if (!ctx->update_progress)
+        return;
+
+    Assert(!ctx->fast_forward);
+
+    /* set output state */
+    ctx->accept_writes = false;
+    ctx->write_xid = txn->xid;
+    ctx->write_location = change->lsn;
+    ctx->end_xact = false;

This patch reverts many of the changes of the previous commit which tried to
fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but
without much explanation of that in the commit message. Its only user,
WalSndUpdateProgress(), is probably making a wrong assumption as well.

     * We don't have a mechanism to get the ack for any LSN other than end
     * xact LSN from the downstream. So, we track lag only for end of
     * transaction LSN.

IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is
sent downstream through a keep alive message. Downstream may acknowledge this
LSN. So we do get ack for any LSN, not just commit LSN.

So I propose removing end_xact as well.

+
+    /*
+     * We don't want to try sending a keepalive message after processing each
+     * change as that can have overhead. Tests revealed that there is no
+     * noticeable overhead in doing it after continuously processing 100 or so
+     * changes.
+     */
+#define CHANGES_THRESHOLD 100

I think a time based threashold makes more sense. What if the timeout was
nearing and those 100 changes just took little more time causing a timeout? We
already have a time based threashold in WalSndKeepaliveIfNecessary(). And that
function is invoked after reading every WAL record in WalSndLoop(). So it does
not look like it's an expensive function. If it is expensive we might want to
worry about WalSndLoop as well. Does it make more sense to remove this
threashold?

+
+    /*
+     * After continuously processing CHANGES_THRESHOLD changes, we
+     * try to send a keepalive message if required.
+     */
+    if (++changes_count >= CHANGES_THRESHOLD)
+    {
+        ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
+        changes_count = 0;
+    }
+}
+

On the other thread, I mentioned that we don't have a TAP test for it.
I agree with
Amit's opinion there that it's hard to create a test which will timeout
everywhere. I think what we need is a way to control the time required for
decoding a transaction.

A rough idea is to induce a small sleep after decoding every change. The amount
of sleep * number of changes will help us estimate and control the amount of
time taken to decode a transaction. Then we create a transaction which will
take longer than the timeout threashold to decode. But that's a
significant code. I
don't think PostgreSQL has a facility to induce a delay at a particular place
in the code.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Jan 6, 2023 at 12:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> +
> +    /*
> +     * We don't want to try sending a keepalive message after processing each
> +     * change as that can have overhead. Tests revealed that there is no
> +     * noticeable overhead in doing it after continuously processing 100 or so
> +     * changes.
> +     */
> +#define CHANGES_THRESHOLD 100
>
> I think a time based threashold makes more sense. What if the timeout was
> nearing and those 100 changes just took little more time causing a timeout? We
> already have a time based threashold in WalSndKeepaliveIfNecessary(). And that
> function is invoked after reading every WAL record in WalSndLoop(). So it does
> not look like it's an expensive function. If it is expensive we might want to
> worry about WalSndLoop as well. Does it make more sense to remove this
> threashold?
>

We have previously tried this for every change [1] and it brings
noticeable overhead. In fact, even doing it for every 10 changes also
had some overhead which is why we reached this threshold number. I
don't think it can lead to timeout due to skipping changes but sure if
we see any such report we can further fine-tune this setting or will
try to make it time-based but for now I feel it would be safe to use
this threshold.

> +
> +    /*
> +     * After continuously processing CHANGES_THRESHOLD changes, we
> +     * try to send a keepalive message if required.
> +     */
> +    if (++changes_count >= CHANGES_THRESHOLD)
> +    {
> +        ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
> +        changes_count = 0;
> +    }
> +}
> +
>
> On the other thread, I mentioned that we don't have a TAP test for it.
> I agree with
> Amit's opinion there that it's hard to create a test which will timeout
> everywhere. I think what we need is a way to control the time required for
> decoding a transaction.
>
> A rough idea is to induce a small sleep after decoding every change. The amount
> of sleep * number of changes will help us estimate and control the amount of
> time taken to decode a transaction. Then we create a transaction which will
> take longer than the timeout threashold to decode. But that's a
> significant code. I
> don't think PostgreSQL has a facility to induce a delay at a particular place
> in the code.
>

Yeah, I don't know how to induce such a delay while decoding changes.

One more thing, I think it would be better to expose a new callback
API via reorder buffer as suggested previously [2] similar to other
reorder buffer APIs instead of directly using reorderbuffer API to
invoke plugin API.


[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[2] - https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Jan 6, 2023 at 15:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> Hi Wang,
> Thanks for working on this. One of our customer faced a similar
> situation when running BDR with PostgreSQL.
> 
> I tested your patch and it solves the problem.
> 
> Please find some review comments below

Thanks for your testing and comments.

> +/*
> + * Helper function for ReorderBufferProcessTXN for updating progress.
> + */
> +static inline void
> +ReorderBufferUpdateProgress(ReorderBuffer *rb, ReorderBufferTXN *txn,
> +                            ReorderBufferChange *change)
> +{
> +    LogicalDecodingContext *ctx = rb->private_data;
> +    static int    changes_count = 0;
> 
> It's not easy to know that a variable is static when reading the code which
> uses it. So it's easy to interpret code wrong. I would probably track it
> through logical decoding context itself OR through a global variable like other
> places where we track the last timestamps. But there's more below on this.

I'm not sure if we need to add global variables or member variables for a
cumulative count that is only used here. How would you feel if I add some
comments when declaring this static variable?

> +
> +    if (!ctx->update_progress)
> +        return;
> +
> +    Assert(!ctx->fast_forward);
> +
> +    /* set output state */
> +    ctx->accept_writes = false;
> +    ctx->write_xid = txn->xid;
> +    ctx->write_location = change->lsn;
> +    ctx->end_xact = false;
> 
> This patch reverts many of the changes of the previous commit which tried to
> fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but
> without much explanation of that in the commit message. Its only user,
> WalSndUpdateProgress(), is probably making a wrong assumption as well.
> 
>      * We don't have a mechanism to get the ack for any LSN other than end
>      * xact LSN from the downstream. So, we track lag only for end of
>      * transaction LSN.
> 
> IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is
> sent downstream through a keep alive message. Downstream may
> acknowledge this
> LSN. So we do get ack for any LSN, not just commit LSN.
> 
> So I propose removing end_xact as well.

We didn't track the lag during a transaction because it could make the
calculations of lag functionality inaccurate. If we track every lsn, it could
fail to record important lsn information because of
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress).
Please see details in [1] and [2].

Regards,
Wang Wei

[1] -
https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[2] -
https://www.postgresql.org/message-id/OS3PR01MB627514AE0B3040D8F55A68B99EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

Thanks for your comments.

> One more thing, I think it would be better to expose a new callback
> API via reorder buffer as suggested previously [2] similar to other
> reorder buffer APIs instead of directly using reorderbuffer API to
> invoke plugin API.

Yes, I agree. I think it would be better to add a new callback API on the HEAD.
So, I improved the fix approach:
Introduce a new optional callback to update the process. This callback function
is invoked at the end inside the main loop of the function
ReorderBufferProcessTXN() for each change. In this way, I think it seems that
similar timeout problems could be avoided.

BTW, I did the performance test for this patch. When running the SQL that
reproduces the problem (refresh the materialized view in sync logical
replication mode), the running time of new function pgoutput_update_progress is
less than 0.1% of the total time. I think this result looks OK.

Attach the new patch.

Regards,
Wang Wei

Вложения

Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Mon, Jan 9, 2023 at 4:08 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Fri, Jan 6, 2023 at 15:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>
> I'm not sure if we need to add global variables or member variables for a
> cumulative count that is only used here. How would you feel if I add some
> comments when declaring this static variable?

I see WalSndUpdateProgress::sendTime is static already. So this seems
fine. A comment will help sure.

>
> > +
> > +    if (!ctx->update_progress)
> > +        return;
> > +
> > +    Assert(!ctx->fast_forward);
> > +
> > +    /* set output state */
> > +    ctx->accept_writes = false;
> > +    ctx->write_xid = txn->xid;
> > +    ctx->write_location = change->lsn;
> > +    ctx->end_xact = false;
> >
> > This patch reverts many of the changes of the previous commit which tried to
> > fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but
> > without much explanation of that in the commit message. Its only user,
> > WalSndUpdateProgress(), is probably making a wrong assumption as well.
> >
> >      * We don't have a mechanism to get the ack for any LSN other than end
> >      * xact LSN from the downstream. So, we track lag only for end of
> >      * transaction LSN.
> >
> > IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is
> > sent downstream through a keep alive message. Downstream may
> > acknowledge this
> > LSN. So we do get ack for any LSN, not just commit LSN.
> >
> > So I propose removing end_xact as well.
>
> We didn't track the lag during a transaction because it could make the
> calculations of lag functionality inaccurate. If we track every lsn, it could
> fail to record important lsn information because of
> WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress).
> Please see details in [1] and [2].

LagTrackerRead() interpolates to reduce the inaccuracy. I don't
understand why we need to track the end LSN only. But I don't think
that affects this fix. So I am fine if we want to leave end_xact
there.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Wed, Jan 11, 2023 at 4:11 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> Thanks for your comments.
>
> > One more thing, I think it would be better to expose a new callback
> > API via reorder buffer as suggested previously [2] similar to other
> > reorder buffer APIs instead of directly using reorderbuffer API to
> > invoke plugin API.
>
> Yes, I agree. I think it would be better to add a new callback API on the HEAD.
> So, I improved the fix approach:
> Introduce a new optional callback to update the process. This callback function
> is invoked at the end inside the main loop of the function
> ReorderBufferProcessTXN() for each change. In this way, I think it seems that
> similar timeout problems could be avoided.

I am a bit worried about the indirections that the wrappers and hooks
create. Output plugins call OutputPluginUpdateProgress() in callbacks
but I don't see why  ReorderBufferProcessTXN() needs a callback to
call OutputPluginUpdateProgress. I don't think output plugins are
going to do anything special with that callback than just call
OutputPluginUpdateProgress. Every output plugin will need to implement
it and if they do not they will face the timeout problem. That would
be unnecessary. Instead ReorderBufferUpdateProgress() in your first
patch was more direct and readable. That way the fix works for any
output plugin. In fact, I am wondering whether we could have a call in
ReorderBufferProcessTxn() at the end of transaction
(commit/prepare/commit prepared/abort prepared) instead of the
corresponding output plugin callbacks calling
OutputPluginUpdateProgress().


-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Jan 16, 2023 at 10:06 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Wed, Jan 11, 2023 at 4:11 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > Thanks for your comments.
> >
> > > One more thing, I think it would be better to expose a new callback
> > > API via reorder buffer as suggested previously [2] similar to other
> > > reorder buffer APIs instead of directly using reorderbuffer API to
> > > invoke plugin API.
> >
> > Yes, I agree. I think it would be better to add a new callback API on the HEAD.
> > So, I improved the fix approach:
> > Introduce a new optional callback to update the process. This callback function
> > is invoked at the end inside the main loop of the function
> > ReorderBufferProcessTXN() for each change. In this way, I think it seems that
> > similar timeout problems could be avoided.
>
> I am a bit worried about the indirections that the wrappers and hooks
> create. Output plugins call OutputPluginUpdateProgress() in callbacks
> but I don't see why  ReorderBufferProcessTXN() needs a callback to
> call OutputPluginUpdateProgress.
>

Yeah, I think we can do it as we are doing the previous approach but
we need an additional wrapper (update_progress_cb_wrapper()) as the
current patch has so that we can add error context information. This
is similar to why we have a wrapper for all other callbacks like
change_cb_wrapper.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> >
> > I am a bit worried about the indirections that the wrappers and hooks
> > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > call OutputPluginUpdateProgress.
> >
>
> Yeah, I think we can do it as we are doing the previous approach but
> we need an additional wrapper (update_progress_cb_wrapper()) as the
> current patch has so that we can add error context information. This
> is similar to why we have a wrapper for all other callbacks like
> change_cb_wrapper.
>

Ultimately OutputPluginUpdateProgress() will be called - which in turn
will call ctx->update_progress. I don't see wrappers around
OutputPluginWrite or OutputPluginPrepareWrite. But I see that those
two are called always from output plugin, so indirectly those are
called through a wrapper. I also see that update_progress_cb_wrapper()
is similar, as far as wrapper is concerned, to
ReorderBufferUpdateProgress() in the earlier patch.
ReorderBufferUpdateProgress() looks more readable than the wrapper.

If we want to keep the wrapper at least we should use a different
variable name. update_progress is also there LogicalDecodingContext
and will be indirectly called from ReorderBuffer::update_progress.
Somebody might think that there's some recursion involved there.
That's a mighty confusion.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > >
> > > I am a bit worried about the indirections that the wrappers and hooks
> > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > call OutputPluginUpdateProgress.
> > >
> >
> > Yeah, I think we can do it as we are doing the previous approach but
> > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > current patch has so that we can add error context information. This
> > is similar to why we have a wrapper for all other callbacks like
> > change_cb_wrapper.
> >
>
> Ultimately OutputPluginUpdateProgress() will be called - which in turn
> will call ctx->update_progress.
>

No, update_progress_cb_wrapper() should directly call
ctx->update_progress(). The key reason to have a
update_progress_cb_wrapper() is that it allows us to add error context
information (see the usage of output_plugin_error_callback).

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > I am a bit worried about the indirections that the wrappers and hooks
> > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > > call OutputPluginUpdateProgress.
> > > >
> > >
> > > Yeah, I think we can do it as we are doing the previous approach but
> > > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > > current patch has so that we can add error context information. This
> > > is similar to why we have a wrapper for all other callbacks like
> > > change_cb_wrapper.
> > >
> >
> > Ultimately OutputPluginUpdateProgress() will be called - which in turn
> > will call ctx->update_progress.
> >
> 
> No, update_progress_cb_wrapper() should directly call
> ctx->update_progress(). The key reason to have a
> update_progress_cb_wrapper() is that it allows us to add error context
> information (see the usage of output_plugin_error_callback).

I think it makes sense. This also avoids the need for every output plugin to
implement the callback. So I tried to improve the patch based on this approach.

And I tried to add some comments for this new callback to distinguish it from
ctx->update_progress.

Attach the new patch.

Regards,
Wang Wei

Вложения

Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Wed, Jan 18, 2023 at 1:49 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > >
> > > > > I am a bit worried about the indirections that the wrappers and hooks
> > > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > > > call OutputPluginUpdateProgress.
> > > > >
> > > >
> > > > Yeah, I think we can do it as we are doing the previous approach but
> > > > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > > > current patch has so that we can add error context information. This
> > > > is similar to why we have a wrapper for all other callbacks like
> > > > change_cb_wrapper.
> > > >
> > >
> > > Ultimately OutputPluginUpdateProgress() will be called - which in turn
> > > will call ctx->update_progress.
> > >
> >
> > No, update_progress_cb_wrapper() should directly call
> > ctx->update_progress(). The key reason to have a
> > update_progress_cb_wrapper() is that it allows us to add error context
> > information (see the usage of output_plugin_error_callback).
>
> I think it makes sense. This also avoids the need for every output plugin to
> implement the callback. So I tried to improve the patch based on this approach.
>
> And I tried to add some comments for this new callback to distinguish it from
> ctx->update_progress.

Comments don't help when using cscope or some such code browsing tool.
Better to use a different variable name.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Jan 18, 2023 at 5:37 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Wed, Jan 18, 2023 at 1:49 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > I am a bit worried about the indirections that the wrappers and hooks
> > > > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > > > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > > > > call OutputPluginUpdateProgress.
> > > > > >
> > > > >
> > > > > Yeah, I think we can do it as we are doing the previous approach but
> > > > > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > > > > current patch has so that we can add error context information. This
> > > > > is similar to why we have a wrapper for all other callbacks like
> > > > > change_cb_wrapper.
> > > > >
> > > >
> > > > Ultimately OutputPluginUpdateProgress() will be called - which in turn
> > > > will call ctx->update_progress.
> > > >
> > >
> > > No, update_progress_cb_wrapper() should directly call
> > > ctx->update_progress(). The key reason to have a
> > > update_progress_cb_wrapper() is that it allows us to add error context
> > > information (see the usage of output_plugin_error_callback).
> >
> > I think it makes sense. This also avoids the need for every output plugin to
> > implement the callback. So I tried to improve the patch based on this approach.
> >
> > And I tried to add some comments for this new callback to distinguish it from
> > ctx->update_progress.
>
> Comments don't help when using cscope or some such code browsing tool.
> Better to use a different variable name.
>

+ /*
+ * Callback to be called when updating progress during sending data of a
+ * transaction (and its subtransactions) to the output plugin.
+ */
+ ReorderBufferUpdateProgressCB update_progress;

Are you suggesting changing the name of the above variable? If so, how
about apply_progress, progress, or updateprogress? If you don't like
any of these then feel free to suggest something else. If we change
the variable name then accordingly, we need to update
ReorderBufferUpdateProgressCB as well.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Wed, Jan 18, 2023 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> + */
> + ReorderBufferUpdateProgressCB update_progress;
>
> Are you suggesting changing the name of the above variable? If so, how
> about apply_progress, progress, or updateprogress? If you don't like
> any of these then feel free to suggest something else. If we change
> the variable name then accordingly, we need to update
> ReorderBufferUpdateProgressCB as well.
>

I would liked to have all the callback names renamed with prefix
"rbcb_xxx" so that they have very less chances of conflicting with
similar names in the code base. But it's probably late to do that :).

How are update_txn_progress since the CB is supposed to be used only
within a transaction? or update_progress_txn?
update_progress_cb_wrapper needs a change of name as well.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Thu, Jan 19, 2023 at 4:13 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Wed, Jan 18, 2023 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > + */
> > + ReorderBufferUpdateProgressCB update_progress;
> >
> > Are you suggesting changing the name of the above variable? If so, how
> > about apply_progress, progress, or updateprogress? If you don't like
> > any of these then feel free to suggest something else. If we change
> > the variable name then accordingly, we need to update
> > ReorderBufferUpdateProgressCB as well.
> >
>
> I would liked to have all the callback names renamed with prefix
> "rbcb_xxx" so that they have very less chances of conflicting with
> similar names in the code base. But it's probably late to do that :).
>
> How are update_txn_progress since the CB is supposed to be used only
> within a transaction? or update_progress_txn?
>

Personally, I would prefer 'apply_progress' as it would be similar to
a few other callbacks like apply_change, apply_truncate, or as is
proposed by patch update_progress again because it is similar to
existing callbacks like commit_prepared. If you and others don't like
any of those then we can go for 'update_progress_txn' as well. Anybody
else has an opinion on this?

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Peter Smith
Дата:
Here are some review comments for patch v3-0001.

======
Commit message

1.
The problem is when there is a DDL in a transaction that generates lots of
temporary data due to rewrite rules, these temporary data will not be processed
by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for DML had no
impact on this case.

~

1a.
IMO this comment needs to give a bit of background about the original
problem here, rather than just starting with "The problem is" which is
describing the flaws of the previous fix.

~

1b.
"pgoutput - plugin" -> "pgoutput plugin" ??

~~~

2.

To fix this, we introduced a new ReorderBuffer callback -
'ReorderBufferUpdateProgressCB'. This callback is called to try to update the
process after each change has been processed during sending data of a
transaction (and its subtransactions) to the output plugin.

IIUC it's not really "after each change" - shouldn't this comment
mention something about the CHANGES_THRESHOLD 100?

======
src/backend/replication/logical/logical.c

3. forward declaration

+/* update progress callback */
+static void update_progress_cb_wrapper(ReorderBuffer *cache,
+    ReorderBufferTXN *txn,
+    ReorderBufferChange *change);

I felt this function wrapper name was a bit misleading... AFAIK every
other wrapper really does just wrap their respective functions. But
this one seems a bit different because it calls the wrapped function
ONLY if some threshold is exceeded. IMO maybe this function could have
some name that conveys this better:

e.g. update_progress_cb_wrapper_with_threshold

~~~

4. update_progress_cb_wrapper

+/*
+ * Update progress callback
+ *
+ * Try to update progress and send a keepalive message if too many changes were
+ * processed when processing txn.
+ *
+ * For a large transaction, if we don't send any change to the downstream for a
+ * long time (exceeds the wal_receiver_timeout of standby) then it can timeout.
+ * This can happen when all or most of the changes are either not published or
+ * got filtered out.
+ */

SUGGESTION (instead of the "Try to update" sentence)
Send a keepalive message whenever more than <CHANGES_THRESHOLD>
changes are encountered while processing a transaction.

~~~

5.

+static void
+update_progress_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+    ReorderBufferChange *change)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ static int changes_count = 0; /* Static variable used to accumulate
+ * the number of changes while
+ * processing txn. */
+

IMO this may be more readable if the static 'changes_count' local var
was declared first and separated from the other vars by a blank line.

~~~

6.

+ /*
+ * We don't want to try sending a keepalive message after processing each
+ * change as that can have overhead. Tests revealed that there is no
+ * noticeable overhead in doing it after continuously processing 100 or so
+ * changes.
+ */
+#define CHANGES_THRESHOLD 100

6a.
I think it might be better to define this right at the top of the
function adjacent to the 'changes_count' variable (e.g. a bit like the
original HEAD code looked)

~

6b.
SUGGESTION (for the comment)
Sending keepalive messages after every change has some overhead, but
testing showed there is no noticeable overhead if keepalive is only
sent after every ~100 changes.

~~~

7.

+
+ /*
+ * After continuously processing CHANGES_THRESHOLD changes, we
+ * try to send a keepalive message if required.
+ */
+ if (++changes_count >= CHANGES_THRESHOLD)
+ {
+ ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
+ changes_count = 0;
+ }
+

7a.
SUGGESTION (for comment)
Send a keepalive message after every CHANGES_THRESHOLD changes.

~

7b.
Would it be neater to just call OutputPluginUpdateProgress here instead?

e.g.
BEFORE
ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
AFTER
OutputPluginUpdateProgress(ctx, false);

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Jan 20, 2023 at 7:40 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are some review comments for patch v3-0001.
>
> ======
> src/backend/replication/logical/logical.c
>
> 3. forward declaration
>
> +/* update progress callback */
> +static void update_progress_cb_wrapper(ReorderBuffer *cache,
> +    ReorderBufferTXN *txn,
> +    ReorderBufferChange *change);
>
> I felt this function wrapper name was a bit misleading... AFAIK every
> other wrapper really does just wrap their respective functions. But
> this one seems a bit different because it calls the wrapped function
> ONLY if some threshold is exceeded. IMO maybe this function could have
> some name that conveys this better:
>
> e.g. update_progress_cb_wrapper_with_threshold
>

I am wondering whether it would be better to move the threshold logic
to the caller. Previously this logic was inside the function because
it was being invoked from multiple places but now that won't be the
case. Also, then your concern about the name would also be addressed.

>
> ~
>
> 7b.
> Would it be neater to just call OutputPluginUpdateProgress here instead?
>
> e.g.
> BEFORE
> ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
> AFTER
> OutputPluginUpdateProgress(ctx, false);
>

We already check whether ctx->update_progress is defined or not which
is the only extra job done by OutputPluginUpdateProgress but probably
we can consolidate the checks and directly invoke
OutputPluginUpdateProgress.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Peter Smith
Дата:
On Fri, Jan 20, 2023 at 3:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 20, 2023 at 7:40 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Here are some review comments for patch v3-0001.
> >
> > ======
> > src/backend/replication/logical/logical.c
> >
> > 3. forward declaration
> >
> > +/* update progress callback */
> > +static void update_progress_cb_wrapper(ReorderBuffer *cache,
> > +    ReorderBufferTXN *txn,
> > +    ReorderBufferChange *change);
> >
> > I felt this function wrapper name was a bit misleading... AFAIK every
> > other wrapper really does just wrap their respective functions. But
> > this one seems a bit different because it calls the wrapped function
> > ONLY if some threshold is exceeded. IMO maybe this function could have
> > some name that conveys this better:
> >
> > e.g. update_progress_cb_wrapper_with_threshold
> >
>
> I am wondering whether it would be better to move the threshold logic
> to the caller. Previously this logic was inside the function because
> it was being invoked from multiple places but now that won't be the
> case. Also, then your concern about the name would also be addressed.
>
> >
> > ~
> >
> > 7b.
> > Would it be neater to just call OutputPluginUpdateProgress here instead?
> >
> > e.g.
> > BEFORE
> > ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
> > AFTER
> > OutputPluginUpdateProgress(ctx, false);
> >
>
> We already check whether ctx->update_progress is defined or not which
> is the only extra job done by OutputPluginUpdateProgress but probably
> we can consolidate the checks and directly invoke
> OutputPluginUpdateProgress.
>

Yes, I saw that, but I thought it was better to keep the early exit
from update_progress_cb_wrapper, so incurring just one additional
boolean check for every 100 changes was not anything to worry about.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thu, Jan 19, 2023 at 19:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 19, 2023 at 4:13 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Wed, Jan 18, 2023 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > > + */
> > > + ReorderBufferUpdateProgressCB update_progress;
> > >
> > > Are you suggesting changing the name of the above variable? If so, how
> > > about apply_progress, progress, or updateprogress? If you don't like
> > > any of these then feel free to suggest something else. If we change
> > > the variable name then accordingly, we need to update
> > > ReorderBufferUpdateProgressCB as well.
> > >
> >
> > I would liked to have all the callback names renamed with prefix
> > "rbcb_xxx" so that they have very less chances of conflicting with
> > similar names in the code base. But it's probably late to do that :).
> >
> > How are update_txn_progress since the CB is supposed to be used only
> > within a transaction? or update_progress_txn?
> >
> 
> Personally, I would prefer 'apply_progress' as it would be similar to
> a few other callbacks like apply_change, apply_truncate, or as is
> proposed by patch update_progress again because it is similar to
> existing callbacks like commit_prepared. If you and others don't like
> any of those then we can go for 'update_progress_txn' as well. Anybody
> else has an opinion on this?

I think 'update_progress_txn' might be better. Because I think this name seems
to make it easier to know that this callback is used to update process when
processing txn. So, I rename it to 'update_progress_txn'.

I have addressed all the comments and here is the new version patch.

Regards,
Wang Wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Jan 20, 2023 at 12:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 20, 2023 at 7:40 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Here are some review comments for patch v3-0001.
> >
> > ======
> > src/backend/replication/logical/logical.c
> >
> > 3. forward declaration
> >
> > +/* update progress callback */
> > +static void update_progress_cb_wrapper(ReorderBuffer *cache,
> > +    ReorderBufferTXN *txn,
> > +    ReorderBufferChange *change);
> >
> > I felt this function wrapper name was a bit misleading... AFAIK every
> > other wrapper really does just wrap their respective functions. But
> > this one seems a bit different because it calls the wrapped function
> > ONLY if some threshold is exceeded. IMO maybe this function could have
> > some name that conveys this better:
> >
> > e.g. update_progress_cb_wrapper_with_threshold
> >
> 
> I am wondering whether it would be better to move the threshold logic
> to the caller. Previously this logic was inside the function because
> it was being invoked from multiple places but now that won't be the
> case. Also, then your concern about the name would also be addressed.

Agree. Moved the threshold logic to the function ReorderBufferProcessTXN.

> >
> > ~
> >
> > 7b.
> > Would it be neater to just call OutputPluginUpdateProgress here instead?
> >
> > e.g.
> > BEFORE
> > ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
> > AFTER
> > OutputPluginUpdateProgress(ctx, false);
> >
> 
> We already check whether ctx->update_progress is defined or not which
> is the only extra job done by OutputPluginUpdateProgress but probably
> we can consolidate the checks and directly invoke
> OutputPluginUpdateProgress.

Changed. Invoke the function OutputPluginUpdateProgress directly in the new
callback.

Regards,
Wang Wei

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Jan 20, 2023 at 10:10 AM Peter Smith <smithpb2250@gmail.com> wrote:
> Here are some review comments for patch v3-0001.

Thanks for your comments.

> ======
> Commit message
> 
> 1.
> The problem is when there is a DDL in a transaction that generates lots of
> temporary data due to rewrite rules, these temporary data will not be
> processed
> by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for DML had no
> impact on this case.
> 
> ~
> 
> 1a.
> IMO this comment needs to give a bit of background about the original
> problem here, rather than just starting with "The problem is" which is
> describing the flaws of the previous fix.

Added some related message.

> ~
> 
> 1b.
> "pgoutput - plugin" -> "pgoutput plugin" ??

Changed.

> ~~~
> 
> 2.
> 
> To fix this, we introduced a new ReorderBuffer callback -
> 'ReorderBufferUpdateProgressCB'. This callback is called to try to update the
> process after each change has been processed during sending data of a
> transaction (and its subtransactions) to the output plugin.
> 
> IIUC it's not really "after each change" - shouldn't this comment
> mention something about the CHANGES_THRESHOLD 100?

Changed.

> ~~~
> 
> 4. update_progress_cb_wrapper
> 
> +/*
> + * Update progress callback
> + *
> + * Try to update progress and send a keepalive message if too many changes
> were
> + * processed when processing txn.
> + *
> + * For a large transaction, if we don't send any change to the downstream for a
> + * long time (exceeds the wal_receiver_timeout of standby) then it can
> timeout.
> + * This can happen when all or most of the changes are either not published or
> + * got filtered out.
> + */
> 
> SUGGESTION (instead of the "Try to update" sentence)
> Send a keepalive message whenever more than <CHANGES_THRESHOLD>
> changes are encountered while processing a transaction.

Since it's possible that keep-alive messages won't be sent even if the
threshold is reached (see function WalSndKeepaliveIfNecessary), I thought it
might be better to use "try to".
And rewrote the comments here because the threshold logic is moved to the
function ReorderBufferProcessTXN.

> ~~~
> 
> 5.
> 
> +static void
> +update_progress_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
> +    ReorderBufferChange *change)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> + static int changes_count = 0; /* Static variable used to accumulate
> + * the number of changes while
> + * processing txn. */
> +
> 
> IMO this may be more readable if the static 'changes_count' local var
> was declared first and separated from the other vars by a blank line.

Changed.

> ~~~
> 
> 6.
> 
> + /*
> + * We don't want to try sending a keepalive message after processing each
> + * change as that can have overhead. Tests revealed that there is no
> + * noticeable overhead in doing it after continuously processing 100 or so
> + * changes.
> + */
> +#define CHANGES_THRESHOLD 100
> 
> 6a.
> I think it might be better to define this right at the top of the
> function adjacent to the 'changes_count' variable (e.g. a bit like the
> original HEAD code looked)

Changed.

> ~
> 
> 6b.
> SUGGESTION (for the comment)
> Sending keepalive messages after every change has some overhead, but
> testing showed there is no noticeable overhead if keepalive is only
> sent after every ~100 changes.

Changed.

> ~~~
> 
> 7.
> 
> +
> + /*
> + * After continuously processing CHANGES_THRESHOLD changes, we
> + * try to send a keepalive message if required.
> + */
> + if (++changes_count >= CHANGES_THRESHOLD)
> + {
> + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
> + changes_count = 0;
> + }
> +
> 
> 7a.
> SUGGESTION (for comment)
> Send a keepalive message after every CHANGES_THRESHOLD changes.

Changed.

Regards,
Wang Wei

Re: Logical replication timeout problem

От
Peter Smith
Дата:
Here are my review comments for patch v4-0001

======
General

1.

It makes no real difference, but I was wondering about:
"update txn progress" versus "update progress txn"

I thought that the first way sounds more natural. YMMV.

If you change this then there is impact for the typedef, function
names, comments, member names:

ReorderBufferUpdateTxnProgressCB -->  ReorderBufferUpdateProgressTxnCB

“/* update progress txn callback */” --> “/* update txn progress callback */”

update_progress_txn_cb_wrapper -->  update_txn_progress_cb_wrapper

updated_progress_txn --> update_txn_progress

======
Commit message

2.

The problem is when there is a DDL in a transaction that generates lots of
temporary data due to rewrite rules, these temporary data will not be processed
by the pgoutput plugin. The previous commit (f95d53e) only fixed timeouts
caused by filtering out changes in pgoutput. Therefore, the previous fix for
DML had no impact on this case.

~

IMO this still some rewording to say up-front what the the actual
problem -- i.e. an avoidable timeout occuring.

SUGGESTION (or something like this...)

When there is a DDL in a transaction that generates lots of temporary
data due to rewrite rules, this temporary data will not be processed
by the pgoutput plugin. This means it is possible for a timeout to
occur if a sufficiently long time elapses since the last pgoutput
message. A previous commit (f95d53e) fixed a similar scenario in this
area, but that only fixed timeouts for DML going through pgoutput, so
it did not address this DDL timeout case.

======
src/backend/replication/logical/logical.c

3. update_progress_txn_cb_wrapper

+/*
+ * Update progress callback while processing a transaction.
+ *
+ * Try to update progress and send a keepalive message during sending data of a
+ * transaction (and its subtransactions) to the output plugin.
+ *
+ * For a large transaction, if we don't send any change to the downstream for a
+ * long time (exceeds the wal_receiver_timeout of standby) then it can timeout.
+ * This can happen when all or most of the changes are either not published or
+ * got filtered out.
+ */
+static void
+update_progress_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+    ReorderBufferChange *change)

Simplify the "Try to..." paragraph. And other part should also mention
about DDL.

SUGGESTION

Try send a keepalive message during transaction processing.

This is done because if we don't send any change to the downstream for
a long time (exceeds the wal_receiver_timeout of standby), then it can
timeout. This can happen for large DDL, or for large transactions when
all or most of the changes are either not published or got filtered
out.

======
.../replication/logical/reorderbuffer.c

4. ReorderBufferProcessTXN

@@ -2105,6 +2105,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,

  PG_TRY();
  {
+ /*
+ * Static variable used to accumulate the number of changes while
+ * processing txn.
+ */
+ static int changes_count = 0;
+
+ /*
+ * Sending keepalive messages after every change has some overhead, but
+ * testing showed there is no noticeable overhead if keepalive is only
+ * sent after every ~100 changes.
+ */
+#define CHANGES_THRESHOLD 100
+

IMO these can be relocated to be declared/defined inside the "while"
loop -- i.e. closer to where they are being used.

~~~

5.

+ if (++changes_count >= CHANGES_THRESHOLD)
+ {
+ rb->update_progress_txn(rb, txn, change);
+ changes_count = 0;
+ }

When there is no update_progress function this code is still incurring
some small additional overhead for incrementing and testing the
THRESHOLD every time, and also needlessly calling to the wrapper every
100x. This overhead could be avoided with a simpler up-front check
like shown below. OTOH, maybe the overhead is insignificant enough
that just leaving the curent code is neater?

LogicalDecodingContext *ctx = rb->private_data;
...
if (ctx->update_progress_txn && (++changes_count >= CHANGES_THRESHOLD))
{
rb->update_progress_txn(rb, txn, change);
changes_count = 0;
}

------
Kind Reagrds,
Peter Smith.
Fujitsu Australia



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Jan 23, 2023 at 6:21 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> 1.
>
> It makes no real difference, but I was wondering about:
> "update txn progress" versus "update progress txn"
>

Yeah, I think we can go either way but I still prefer "update progress
txn" as that is more closer to LogicalOutputPluginWriterUpdateProgress
callback name.

>
> 5.
>
> + if (++changes_count >= CHANGES_THRESHOLD)
> + {
> + rb->update_progress_txn(rb, txn, change);
> + changes_count = 0;
> + }
>
> When there is no update_progress function this code is still incurring
> some small additional overhead for incrementing and testing the
> THRESHOLD every time, and also needlessly calling to the wrapper every
> 100x. This overhead could be avoided with a simpler up-front check
> like shown below. OTOH, maybe the overhead is insignificant enough
> that just leaving the curent code is neater?
>

As far as built-in logical replication is concerned, it will be
defined and I don't know if the overhead will be significant enough in
this case. Also, one can say that for the cases it is defined, we are
adding this check multiple times (it is already checked inside
OutputPluginUpdateProgress). So, I would prefer a neat code here.

-- 
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"houzj.fnst@fujitsu.com"
Дата:
On Monday, January 23, 2023 8:51 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are my review comments for patch v4-0001
> ======
> Commit message
> 
> 2.
> 
> The problem is when there is a DDL in a transaction that generates lots of
> temporary data due to rewrite rules, these temporary data will not be processed
> by the pgoutput plugin. The previous commit (f95d53e) only fixed timeouts
> caused by filtering out changes in pgoutput. Therefore, the previous fix for DML
> had no impact on this case.
> 
> ~
> 
> IMO this still some rewording to say up-front what the the actual problem -- i.e.
> an avoidable timeout occuring.
> 
> SUGGESTION (or something like this...)
> 
> When there is a DDL in a transaction that generates lots of temporary data due
> to rewrite rules, this temporary data will not be processed by the pgoutput
> plugin. This means it is possible for a timeout to occur if a sufficiently long time
> elapses since the last pgoutput message. A previous commit (f95d53e) fixed a
> similar scenario in this area, but that only fixed timeouts for DML going through
> pgoutput, so it did not address this DDL timeout case.

Thanks, I changed the commit message as suggested.

> ======
> src/backend/replication/logical/logical.c
> 
> 3. update_progress_txn_cb_wrapper
> 
> +/*
> + * Update progress callback while processing a transaction.
> + *
> + * Try to update progress and send a keepalive message during sending
> +data of a
> + * transaction (and its subtransactions) to the output plugin.
> + *
> + * For a large transaction, if we don't send any change to the
> +downstream for a
> + * long time (exceeds the wal_receiver_timeout of standby) then it can timeout.
> + * This can happen when all or most of the changes are either not
> +published or
> + * got filtered out.
> + */
> +static void
> +update_progress_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN
> *txn,
> +    ReorderBufferChange *change)
> 
> Simplify the "Try to..." paragraph. And other part should also mention about DDL.
> 
> SUGGESTION
> 
> Try send a keepalive message during transaction processing.
> 
> This is done because if we don't send any change to the downstream for a long
> time (exceeds the wal_receiver_timeout of standby), then it can timeout. This can
> happen for large DDL, or for large transactions when all or most of the changes
> are either not published or got filtered out.

Changed.

> ======
> .../replication/logical/reorderbuffer.c
> 
> 4. ReorderBufferProcessTXN
> 
> @@ -2105,6 +2105,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
> 
>   PG_TRY();
>   {
> + /*
> + * Static variable used to accumulate the number of changes while
> + * processing txn.
> + */
> + static int changes_count = 0;
> +
> + /*
> + * Sending keepalive messages after every change has some overhead, but
> + * testing showed there is no noticeable overhead if keepalive is only
> + * sent after every ~100 changes.
> + */
> +#define CHANGES_THRESHOLD 100
> +
> 
> IMO these can be relocated to be declared/defined inside the "while"
> loop -- i.e. closer to where they are being used.

Moved into the while loop.

Attach the new version patch which addressed above comments.
Also attach a simple script which use "refresh matview" to reproduce
this timeout problem just in case some one want to try to reproduce this.

Best regards,
Hou zj

Вложения

Re: Logical replication timeout problem

От
Peter Smith
Дата:
Hi Hou-san, Here are my review comments for v5-0001.

======
src/backend/replication/logical/reorderbuffer.c

1.
@@ -2446,6 +2452,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  elog(ERROR, "tuplecid value in changequeue");
  break;
  }
+
+ /*
+ * Sending keepalive messages after every change has some overhead, but
+ * testing showed there is no noticeable overhead if keepalive is only
+ * sent after every ~100 changes.
+ */
+#define CHANGES_THRESHOLD 100
+
+ /*
+ * Try to send a keepalive message after every CHANGES_THRESHOLD
+ * changes.
+ */
+ if (++changes_count >= CHANGES_THRESHOLD)
+ {
+ rb->update_progress_txn(rb, txn, change);
+ changes_count = 0;
+ }

I noticed you put the #define adjacent to the only usage of it,
instead of with the other variable declaration like it was before.
Probably it is better how you have done it, but:

1a.
The comment indentation is incorrect.

~

1b.
Since the #define is adjacent to its only usage IMO now the 2nd
comment is redundant. So the code can just say

           /*
            * Sending keepalive messages after every change has some
overhead, but
            * testing showed there is no noticeable overhead if
keepalive is only
            * sent after every ~100 changes.
            */
#define CHANGES_THRESHOLD 100
            if (++changes_count >= CHANGES_THRESHOLD)
            {
                rb->update_progress_txn(rb, txn, change);
                changes_count = 0;
            }

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Tues, Jan 24, 2023 at 8:28 AM Peter Smith <smithpb2250@gmail.com> wrote:
> Hi Hou-san, Here are my review comments for v5-0001.

Thanks for your comments.

> ======
> src/backend/replication/logical/reorderbuffer.c
> 
> 1.
> @@ -2446,6 +2452,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>   elog(ERROR, "tuplecid value in changequeue");
>   break;
>   }
> +
> + /*
> + * Sending keepalive messages after every change has some overhead, but
> + * testing showed there is no noticeable overhead if keepalive is only
> + * sent after every ~100 changes.
> + */
> +#define CHANGES_THRESHOLD 100
> +
> + /*
> + * Try to send a keepalive message after every CHANGES_THRESHOLD
> + * changes.
> + */
> + if (++changes_count >= CHANGES_THRESHOLD)
> + {
> + rb->update_progress_txn(rb, txn, change);
> + changes_count = 0;
> + }
> 
> I noticed you put the #define adjacent to the only usage of it,
> instead of with the other variable declaration like it was before.
> Probably it is better how you have done it, but:
> 
> 1a.
> The comment indentation is incorrect.
> 
> ~
> 
> 1b.
> Since the #define is adjacent to its only usage IMO now the 2nd
> comment is redundant. So the code can just say
> 
>            /*
>             * Sending keepalive messages after every change has some
> overhead, but
>             * testing showed there is no noticeable overhead if
> keepalive is only
>             * sent after every ~100 changes.
>             */
> #define CHANGES_THRESHOLD 100
>             if (++changes_count >= CHANGES_THRESHOLD)
>             {
>                 rb->update_progress_txn(rb, txn, change);
>                 changes_count = 0;
>             }

Changed as suggested.

Attach the new patch.

Regards,
Wang Wei

Вложения

Re: Logical replication timeout problem

От
Peter Smith
Дата:
On Tue, Jan 24, 2023 at 1:45 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Tues, Jan 24, 2023 at 8:28 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > Hi Hou-san, Here are my review comments for v5-0001.
>
> Thanks for your comments.
...
>
> Changed as suggested.
>
> Attach the new patch.

Thanks! Patch v6 LGTM.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> Attach the new patch.
>

I think the patch missed to handle the case of non-transactional
messages which was previously getting handled. I have tried to address
that in the attached. Is there a reason that shouldn't be handled?
Apart from that changed a few comments. If my understanding is
correct, then we need to change the callback update_progress_txn name
as well because now it needs to handle both transactional and
non-transactional changes. How about update_progress_write? We
accordingly need to change the comments for the callback.

Additionally, I think we should have a test case to show we don't time
out because of not processing non-transactional messages. See
pgoutput_message for cases where it doesn't process the message.

-- 
With Regards,
Amit Kapila.

Вложения

RE: Logical replication timeout problem

От
"houzj.fnst@fujitsu.com"
Дата:
On Wednesday, January 25, 2023 7:26 PM Amit Kapila <amit.kapila16@gmail.com>
> 
> On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > Attach the new patch.
> >
> 
> I think the patch missed to handle the case of non-transactional messages which
> was previously getting handled. I have tried to address that in the attached. Is
> there a reason that shouldn't be handled?

Thanks for updating the patch!

I thought about the non-transactional message. I think it seems fine if we
don’t handle it for timeout because such message is decoded via:

WalSndLoop
-XLogSendLogical
--LogicalDecodingProcessRecord
---logicalmsg_decode
----ReorderBufferQueueMessage
-----rb->message() -- //maybe send the message or do nothing here.

After invoking rb->message(), we will directly return to the main
loop(WalSndLoop) where we will get a chance to call
WalSndKeepaliveIfNecessary() to avoid the timeout.

This is a bit different from transactional changes, because for transactional changes, we
will buffer them and then send every buffered change one by one(via
ReorderBufferProcessTXN) without going back to the WalSndLoop, so we don't get
a chance to send keepalive message if necessary, which is more likely to cause the
timeout problem.

I will also test the non-transactional message for timeout in case I missed something.

Best Regards,
Hou zj

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Fri, Jan 27, 2023 at 5:18 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 25, 2023 7:26 PM Amit Kapila <amit.kapila16@gmail.com>
> >
> > On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new patch.
> > >
> >
> > I think the patch missed to handle the case of non-transactional messages which
> > was previously getting handled. I have tried to address that in the attached. Is
> > there a reason that shouldn't be handled?
>
> Thanks for updating the patch!
>
> I thought about the non-transactional message. I think it seems fine if we
> don’t handle it for timeout because such message is decoded via:
>
> WalSndLoop
> -XLogSendLogical
> --LogicalDecodingProcessRecord
> ---logicalmsg_decode
> ----ReorderBufferQueueMessage
> -----rb->message() -- //maybe send the message or do nothing here.
>
> After invoking rb->message(), we will directly return to the main
> loop(WalSndLoop) where we will get a chance to call
> WalSndKeepaliveIfNecessary() to avoid the timeout.
>

Valid point. But this means the previous handling of non-transactional
messages was also redundant.

> This is a bit different from transactional changes, because for transactional changes, we
> will buffer them and then send every buffered change one by one(via
> ReorderBufferProcessTXN) without going back to the WalSndLoop, so we don't get
> a chance to send keepalive message if necessary, which is more likely to cause the
> timeout problem.
>
> I will also test the non-transactional message for timeout in case I missed something.
>

Okay, thanks. Please see if we can test a mix of transactional and
non-transactional messages as well.

--
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Fri, Jan 27, 2023 at 19:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 27, 2023 at 5:18 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, January 25, 2023 7:26 PM Amit Kapila
> <amit.kapila16@gmail.com>
> > >
> > > On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> > > >
> > > > Attach the new patch.
> > > >
> > >
> > > I think the patch missed to handle the case of non-transactional messages
> which
> > > was previously getting handled. I have tried to address that in the attached.
> Is
> > > there a reason that shouldn't be handled?
> >
> > Thanks for updating the patch!
> >
> > I thought about the non-transactional message. I think it seems fine if we
> > don’t handle it for timeout because such message is decoded via:
> >
> > WalSndLoop
> > -XLogSendLogical
> > --LogicalDecodingProcessRecord
> > ---logicalmsg_decode
> > ----ReorderBufferQueueMessage
> > -----rb->message() -- //maybe send the message or do nothing here.
> >
> > After invoking rb->message(), we will directly return to the main
> > loop(WalSndLoop) where we will get a chance to call
> > WalSndKeepaliveIfNecessary() to avoid the timeout.
> >
> 
> Valid point. But this means the previous handling of non-transactional
> messages was also redundant.

Thanks for the analysis, I think it makes sense. So I removed the handling of
non-transactional messages.

> > This is a bit different from transactional changes, because for transactional
> changes, we
> > will buffer them and then send every buffered change one by one(via
> > ReorderBufferProcessTXN) without going back to the WalSndLoop, so we
> don't get
> > a chance to send keepalive message if necessary, which is more likely to cause
> the
> > timeout problem.
> >
> > I will also test the non-transactional message for timeout in case I missed
> something.
> >
> 
> Okay, thanks. Please see if we can test a mix of transactional and
> non-transactional messages as well.

I tested a mix transaction of transactional and non-transactional messages on
the current HEAD and reproduced the timeout problem. I think this result is OK.
Because when decoding a transaction, non-transactional changes are processed
directly and the function WalSndKeepaliveIfNecessary is called, while
transactional changes are cached and processed after decoding. After decoding,
only transactional changes will be processed (in the function
ReorderBufferProcessTXN), so the timeout problem will still be reproduced.

After applying the v8 patch, the test mentioned above didn't reproduce the
timeout problem (Attach this test script 'test_with_nontransactional.sh').

Attach the new patch.

Regards,
Wang Wei

Вложения

RE: Logical replication timeout problem

От
"shiy.fnst@fujitsu.com"
Дата:
On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
> 
> I tested a mix transaction of transactional and non-transactional messages on
> the current HEAD and reproduced the timeout problem. I think this result is OK.
> Because when decoding a transaction, non-transactional changes are processed
> directly and the function WalSndKeepaliveIfNecessary is called, while
> transactional changes are cached and processed after decoding. After decoding,
> only transactional changes will be processed (in the function
> ReorderBufferProcessTXN), so the timeout problem will still be reproduced.
> 
> After applying the v8 patch, the test mentioned above didn't reproduce the
> timeout problem (Attach this test script 'test_with_nontransactional.sh').
> 
> Attach the new patch.
> 

Thanks for updating the patch. Here is a comment.

In update_progress_txn_cb_wrapper(), it looks like we need to reset
changes_count to 0 after calling OutputPluginUpdateProgress(), otherwise
OutputPluginUpdateProgress() will always be called after 100 changes.

Regards,
Shi yu


RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> wrote:
> On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > I tested a mix transaction of transactional and non-transactional messages on
> > the current HEAD and reproduced the timeout problem. I think this result is
> OK.
> > Because when decoding a transaction, non-transactional changes are
> processed
> > directly and the function WalSndKeepaliveIfNecessary is called, while
> > transactional changes are cached and processed after decoding. After
> decoding,
> > only transactional changes will be processed (in the function
> > ReorderBufferProcessTXN), so the timeout problem will still be reproduced.
> >
> > After applying the v8 patch, the test mentioned above didn't reproduce the
> > timeout problem (Attach this test script 'test_with_nontransactional.sh').
> >
> > Attach the new patch.
> >
> 
> Thanks for updating the patch. Here is a comment.

Thanks for your comment.

> In update_progress_txn_cb_wrapper(), it looks like we need to reset
> changes_count to 0 after calling OutputPluginUpdateProgress(), otherwise
> OutputPluginUpdateProgress() will always be called after 100 changes.

Yes, I think you are right.
Fixed this problem.

Attach the new patch.

Regards,
Wang Wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Mon, Jan 30, 2023 at 10:36 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> wrote:
> > On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
>
> Yes, I think you are right.
> Fixed this problem.
>

+ /*
+ * Trying to send keepalive message after every change has some
+ * overhead, but testing showed there is no noticeable overhead if
+ * we do it after every ~100 changes.
+ */
+#define CHANGES_THRESHOLD 100
+
+ if (++changes_count < CHANGES_THRESHOLD)
+ return;
...
+ changes_count = 0;

I think it is better to have this threshold-related code in that
caller as we have in the previous version. Also, let's modify the
comment as follows:"
It is possible that the data is not sent to downstream for a long time
either because the output plugin filtered it or there is a DDL that
generates a lot of data that is not processed by the plugin. So, in
such cases, the downstream can timeout. To avoid that we try to send a
keepalive message if required. Trying to send a keepalive message
after every change has some overhead, but testing showed there is no
noticeable overhead if we do it after every ~100 changes."

--
With Regards,
Amit Kapila.



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Jan 30, 2023 at 14:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 30, 2023 at 10:36 AM wangw.fnst@fujitsu.com
> <wangw.fnst@fujitsu.com> wrote:
> >
> > On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com>
> wrote:
> > > On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com
> > > <wangw.fnst@fujitsu.com> wrote:
> >
> > Yes, I think you are right.
> > Fixed this problem.
> >
> 
> + /*
> + * Trying to send keepalive message after every change has some
> + * overhead, but testing showed there is no noticeable overhead if
> + * we do it after every ~100 changes.
> + */
> +#define CHANGES_THRESHOLD 100
> +
> + if (++changes_count < CHANGES_THRESHOLD)
> + return;
> ...
> + changes_count = 0;
> 
> I think it is better to have this threshold-related code in that
> caller as we have in the previous version. Also, let's modify the
> comment as follows:"
> It is possible that the data is not sent to downstream for a long time
> either because the output plugin filtered it or there is a DDL that
> generates a lot of data that is not processed by the plugin. So, in
> such cases, the downstream can timeout. To avoid that we try to send a
> keepalive message if required. Trying to send a keepalive message
> after every change has some overhead, but testing showed there is no
> noticeable overhead if we do it after every ~100 changes."

Changed as suggested.

I also removed the comment atop the function update_progress_txn_cb_wrapper to
be consistent with the nearby *_cb_wrapper functions.

Attach the new patch.

Regards,
Wang Wei

Вложения

RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Mon, Jan 30, 2023 at 17:50 PM I wrote:
> Attach the new patch.

When invoking the function ReorderBufferProcessTXN, the threshold-related
counter "changes_count" may have some random value from the previous
transaction's processing. To fix this, I moved the definition of the counter
"changes_count" outside the while-loop and did not use the keyword "static".

Attach the new patch.

Regards,
Wang Wei

Вложения

Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Jan 31, 2023 at 2:53 PM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Mon, Jan 30, 2023 at 17:50 PM I wrote:
> > Attach the new patch.
>
> When invoking the function ReorderBufferProcessTXN, the threshold-related
> counter "changes_count" may have some random value from the previous
> transaction's processing. To fix this, I moved the definition of the counter
> "changes_count" outside the while-loop and did not use the keyword "static".
>
> Attach the new patch.
>

Thanks, the patch looks good to me. I have slightly adjusted one of
the comments and ran pgindent. See attached. As mentioned in the
commit message, we shouldn't backpatch this as this requires a new
callback and moreover, users can increase the wal_sender_timeout and
wal_receiver_timeout to avoid this problem. What do you think?

-- 
With Regards,
Amit Kapila.

Вложения

Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Thanks, the patch looks good to me. I have slightly adjusted one of
> the comments and ran pgindent. See attached. As mentioned in the
> commit message, we shouldn't backpatch this as this requires a new
> callback and moreover, users can increase the wal_sender_timeout and
> wal_receiver_timeout to avoid this problem. What do you think?

The callback and the implementation is all in core. What's the risk
you see in backpatching it?

Customers can adjust the timeouts, but only after the receiver has
timed out a few times. Replication remains broekn till they notice it
and adjust timeouts. By that time WAL has piled up. It also takes a
few attempts to increase timeouts since the time taken by a
transaction to decode can not be estimated beforehand. All that makes
it worth back-patching if it's possible. We had a customer who piled
up GBs of WAL before realising that this is the problem. Their system
almost came to a halt due to that.

--
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Thanks, the patch looks good to me. I have slightly adjusted one of
> > the comments and ran pgindent. See attached. As mentioned in the
> > commit message, we shouldn't backpatch this as this requires a new
> > callback and moreover, users can increase the wal_sender_timeout and
> > wal_receiver_timeout to avoid this problem. What do you think?
>
> The callback and the implementation is all in core. What's the risk
> you see in backpatching it?
>

Because we are changing the exposed structure and which can break
existing extensions using it.

> Customers can adjust the timeouts, but only after the receiver has
> timed out a few times. Replication remains broekn till they notice it
> and adjust timeouts. By that time WAL has piled up. It also takes a
> few attempts to increase timeouts since the time taken by a
> transaction to decode can not be estimated beforehand. All that makes
> it worth back-patching if it's possible. We had a customer who piled
> up GBs of WAL before realising that this is the problem. Their system
> almost came to a halt due to that.
>

Which version are they using? If they are at >=14, using "streaming =
on" for a subscription should also avoid this problem.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Ashutosh Bapat
Дата:
On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > Thanks, the patch looks good to me. I have slightly adjusted one of
> > > the comments and ran pgindent. See attached. As mentioned in the
> > > commit message, we shouldn't backpatch this as this requires a new
> > > callback and moreover, users can increase the wal_sender_timeout and
> > > wal_receiver_timeout to avoid this problem. What do you think?
> >
> > The callback and the implementation is all in core. What's the risk
> > you see in backpatching it?
> >
>
> Because we are changing the exposed structure and which can break
> existing extensions using it.

Is that because we are adding the new member in the middle of the
structure? Shouldn't extensions provide new libraries with each
maintenance release of PG?

>
> > Customers can adjust the timeouts, but only after the receiver has
> > timed out a few times. Replication remains broekn till they notice it
> > and adjust timeouts. By that time WAL has piled up. It also takes a
> > few attempts to increase timeouts since the time taken by a
> > transaction to decode can not be estimated beforehand. All that makes
> > it worth back-patching if it's possible. We had a customer who piled
> > up GBs of WAL before realising that this is the problem. Their system
> > almost came to a halt due to that.
> >
>
> Which version are they using? If they are at >=14, using "streaming =
> on" for a subscription should also avoid this problem.

13.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical replication timeout problem

От
Peter Smith
Дата:
Here are my review comments for v13-00001.

======
Commit message

1.
The DDLs like Refresh Materialized views that generate lots of temporary
data due to rewrite rules may not be processed by output plugins (for
example pgoutput). So, we won't send keep-alive messages for a long time
while processing such commands and that can lead the subscriber side to
timeout.

~

SUGGESTION (minor rearranged way to say the same thing)

For DDLs that generate lots of temporary data due to rewrite rules
(e.g. REFRESH MATERIALIZED VIEW) the output plugins (e.g. pgoutput)
may not be processed for a long time. Since we don't send keep-alive
messages while processing such commands that can lead the subscriber
side to timeout.

~~~

2.
The commit message says what the problem is, but it doesn’t seem to
describe what this patch does to fix the problem.

======
src/backend/replication/logical/reorderbuffer.c

3.
+ /*
+ * It is possible that the data is not sent to downstream for a
+ * long time either because the output plugin filtered it or there
+ * is a DDL that generates a lot of data that is not processed by
+ * the plugin. So, in such cases, the downstream can timeout. To
+ * avoid that we try to send a keepalive message if required.
+ * Trying to send a keepalive message after every change has some
+ * overhead, but testing showed there is no noticeable overhead if
+ * we do it after every ~100 changes.
+ */


3a.
"data is not sent to downstream" --> "data is not sent downstream" (?)

~

3b.
"So, in such cases," --> "In such cases,"

~~~

4.
+#define CHANGES_THRESHOLD 100
+
+ if (++changes_count >= CHANGES_THRESHOLD)
+ {
+ rb->update_progress_txn(rb, txn, change->lsn);
+ changes_count = 0;
+ }

I was wondering if it would have been simpler to write this code like below.

Also, by doing it this way the 'changes_count' variable name makes
more sense IMO, otherwise (for current code) maybe it should be called
something like 'changes_since_last_keepalive'

SUGGESTION
if (++changes_count % CHANGES_THRESHOLD == 0)
    rb->update_progress_txn(rb, txn, change->lsn);


------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Feb 1, 2023 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are my review comments for v13-00001.
>
> ======
> Commit message
>
> 1.
> The DDLs like Refresh Materialized views that generate lots of temporary
> data due to rewrite rules may not be processed by output plugins (for
> example pgoutput). So, we won't send keep-alive messages for a long time
> while processing such commands and that can lead the subscriber side to
> timeout.
>
> ~
>
> SUGGESTION (minor rearranged way to say the same thing)
>
> For DDLs that generate lots of temporary data due to rewrite rules
> (e.g. REFRESH MATERIALIZED VIEW) the output plugins (e.g. pgoutput)
> may not be processed for a long time. Since we don't send keep-alive
> messages while processing such commands that can lead the subscriber
> side to timeout.
>

Hmm, this makes it less clear and in fact changed the meaning.

> ~~~
>
> 2.
> The commit message says what the problem is, but it doesn’t seem to
> describe what this patch does to fix the problem.
>

I thought it was apparent and the code comments made it clear.

>
> 4.
> +#define CHANGES_THRESHOLD 100
> +
> + if (++changes_count >= CHANGES_THRESHOLD)
> + {
> + rb->update_progress_txn(rb, txn, change->lsn);
> + changes_count = 0;
> + }
>
> I was wondering if it would have been simpler to write this code like below.
>
> Also, by doing it this way the 'changes_count' variable name makes
> more sense IMO, otherwise (for current code) maybe it should be called
> something like 'changes_since_last_keepalive'
>
> SUGGESTION
> if (++changes_count % CHANGES_THRESHOLD == 0)
>     rb->update_progress_txn(rb, txn, change->lsn);
>

I find the current code in the patch clear and easy to understand.

--
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Tue, Jan 31, 2023 at 8:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > Thanks, the patch looks good to me. I have slightly adjusted one of
> > > > the comments and ran pgindent. See attached. As mentioned in the
> > > > commit message, we shouldn't backpatch this as this requires a new
> > > > callback and moreover, users can increase the wal_sender_timeout and
> > > > wal_receiver_timeout to avoid this problem. What do you think?
> > >
> > > The callback and the implementation is all in core. What's the risk
> > > you see in backpatching it?
> > >
> >
> > Because we are changing the exposed structure and which can break
> > existing extensions using it.
>
> Is that because we are adding the new member in the middle of the
> structure?
>

Not only that but this changes the size of the structure and we want
to avoid that as well in stable branches. See email [1] (you can't
change the struct size either ...). As per my understanding, our usual
practice is to not change the exposed structure's size/definition in
stable branches.


[1] - https://www.postgresql.org/message-id/2358496.1649168259%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Feb 1, 2023 at 10:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 8:24 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > Thanks, the patch looks good to me. I have slightly adjusted one of
> > > > > the comments and ran pgindent. See attached. As mentioned in the
> > > > > commit message, we shouldn't backpatch this as this requires a new
> > > > > callback and moreover, users can increase the wal_sender_timeout and
> > > > > wal_receiver_timeout to avoid this problem. What do you think?
> > > >
> > > > The callback and the implementation is all in core. What's the risk
> > > > you see in backpatching it?
> > > >
> > >
> > > Because we are changing the exposed structure and which can break
> > > existing extensions using it.
> >
> > Is that because we are adding the new member in the middle of the
> > structure?
> >
>
> Not only that but this changes the size of the structure and we want
> to avoid that as well in stable branches. See email [1] (you can't
> change the struct size either ...). As per my understanding, our usual
> practice is to not change the exposed structure's size/definition in
> stable branches.
>
>

I am planning to push this to HEAD sometime next week (by Wednesday).
To backpatch this, we need to fix it in some non-standard way, like
without introducing a callback which I am not sure is a good idea. If
some other committers vote to get this in back branches with that or
some different idea that can be backpatched then we can do that
separately as well. I don't see this as a must-fix in back branches
because we have a workaround (increase timeout) or users can use the
streaming option (for >=14).

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Andres Freund
Дата:
Hi,

On 2023-02-03 10:13:54 +0530, Amit Kapila wrote:
> I am planning to push this to HEAD sometime next week (by Wednesday).
> To backpatch this, we need to fix it in some non-standard way, like
> without introducing a callback which I am not sure is a good idea. If
> some other committers vote to get this in back branches with that or
> some different idea that can be backpatched then we can do that
> separately as well. I don't see this as a must-fix in back branches
> because we have a workaround (increase timeout) or users can use the
> streaming option (for >=14).

I just saw the commit go in, and a quick scan over it makes me think neither
this commit, nor f95d53eded, which unfortunately was already backpatched, is
the right direction. The wrong direction likely started quite a bit earlier,
with 024711bb544.

It feels quite fundamentally wrong that bascially every output plugin needs to
call a special function in nearly every callback.

In 024711bb544 there was just one call to OutputPluginUpdateProgress() in
pgoutput.c. Quite tellingly, it just updated pgoutput, without touching
test_decoding.

Then a8fd13cab0b added to more calls. 63cf61cdeb7 yet another.


This makes no sense.  There's lots of output plugins out there. There's an
increasing number of callbacks.  This isn't a maintainable path forward.


If we want to call something to maintain state, it has to be happening from
central infrastructure.


It feels quite odd architecturally that WalSndUpdateProgress() ends up
flushing out writes - that's far far from obvious.

I don't think:
/*
 * Wait until there is no pending write. Also process replies from the other
 * side and check timeouts during that.
 */
static void
ProcessPendingWrites(void)

Is really a good name. What are we processing? What are we actually waiting
for - because we don't actually wait for the data to sent out or anything,
just that they're in a network buffer.


Greetings,

Andres Freund



Re: Logical replication timeout problem

От
Amit Kapila
Дата:
On Wed, Feb 8, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-02-03 10:13:54 +0530, Amit Kapila wrote:
> > I am planning to push this to HEAD sometime next week (by Wednesday).
> > To backpatch this, we need to fix it in some non-standard way, like
> > without introducing a callback which I am not sure is a good idea. If
> > some other committers vote to get this in back branches with that or
> > some different idea that can be backpatched then we can do that
> > separately as well. I don't see this as a must-fix in back branches
> > because we have a workaround (increase timeout) or users can use the
> > streaming option (for >=14).
>
> I just saw the commit go in, and a quick scan over it makes me think neither
> this commit, nor f95d53eded, which unfortunately was already backpatched, is
> the right direction. The wrong direction likely started quite a bit earlier,
> with 024711bb544.
>
> It feels quite fundamentally wrong that bascially every output plugin needs to
> call a special function in nearly every callback.
>
> In 024711bb544 there was just one call to OutputPluginUpdateProgress() in
> pgoutput.c. Quite tellingly, it just updated pgoutput, without touching
> test_decoding.
>
> Then a8fd13cab0b added to more calls. 63cf61cdeb7 yet another.
>

I think the original commit 024711bb544 forgets to call it in
test_decoding and the other commits followed the same and missed to
update test_decoding.

>
> This makes no sense.  There's lots of output plugins out there. There's an
> increasing number of callbacks.  This isn't a maintainable path forward.
>
>
> If we want to call something to maintain state, it has to be happening from
> central infrastructure.
>
>
> It feels quite odd architecturally that WalSndUpdateProgress() ends up
> flushing out writes - that's far far from obvious.
>
> I don't think:
> /*
>  * Wait until there is no pending write. Also process replies from the other
>  * side and check timeouts during that.
>  */
> static void
> ProcessPendingWrites(void)
>
> Is really a good name. What are we processing?
>

It is for sending the keep_alive message (if required). That is
normally done when we skipped processing a transaction to ensure sync
replication is not delayed. It has been discussed previously [1][2] to
extend the WalSndUpdateProgress() interface. Basically, as explained
by Craig [2], this has to be done from plugin as it can do filtering
or there could be other reasons why the output plugin skips all
changes. We used the same interface for sending keep-alive message
when we processed a lot of (DDL) changes without sending anything to
plugin.

[1] - https://www.postgresql.org/message-id/20200309183018.tzkzwu635sd366ej%40alap3.anarazel.de
[2] - https://www.postgresql.org/message-id/CAMsr%2BYE3o8Dt890Q8wTooY2MpN0JvdHqUAHYL-LNhBryXOPaKg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
Andres Freund
Дата:
Hi,

On 2023-02-08 13:36:02 +0530, Amit Kapila wrote:
> On Wed, Feb 8, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2023-02-03 10:13:54 +0530, Amit Kapila wrote:
> > > I am planning to push this to HEAD sometime next week (by Wednesday).
> > > To backpatch this, we need to fix it in some non-standard way, like
> > > without introducing a callback which I am not sure is a good idea. If
> > > some other committers vote to get this in back branches with that or
> > > some different idea that can be backpatched then we can do that
> > > separately as well. I don't see this as a must-fix in back branches
> > > because we have a workaround (increase timeout) or users can use the
> > > streaming option (for >=14).
> >
> > I just saw the commit go in, and a quick scan over it makes me think neither
> > this commit, nor f95d53eded, which unfortunately was already backpatched, is
> > the right direction. The wrong direction likely started quite a bit earlier,
> > with 024711bb544.
> >
> > It feels quite fundamentally wrong that bascially every output plugin needs to
> > call a special function in nearly every callback.
> >
> > In 024711bb544 there was just one call to OutputPluginUpdateProgress() in
> > pgoutput.c. Quite tellingly, it just updated pgoutput, without touching
> > test_decoding.
> >
> > Then a8fd13cab0b added to more calls. 63cf61cdeb7 yet another.
> >
> 
> I think the original commit 024711bb544 forgets to call it in
> test_decoding and the other commits followed the same and missed to
> update test_decoding.

I think that's a symptom of the wrong architecture having been chosen. This
should *never* have been the task of output plugins.


> > I don't think:
> > /*
> >  * Wait until there is no pending write. Also process replies from the other
> >  * side and check timeouts during that.
> >  */
> > static void
> > ProcessPendingWrites(void)
> >
> > Is really a good name. What are we processing?
> >
> 
> It is for sending the keep_alive message (if required). That is
> normally done when we skipped processing a transaction to ensure sync
> replication is not delayed.

But how is that "processing pending writes"? For me "processing" implies we're
doing some analysis on them or such.


If we want to write data in WalSndUpdateProgress(), shouldn't we move the
common code of WalSndWriteData() and WalSndUpdateProgress() into
ProcessPendingWrites()?


> It has been discussed previously [1][2] to
> extend the WalSndUpdateProgress() interface. Basically, as explained
> by Craig [2], this has to be done from plugin as it can do filtering
> or there could be other reasons why the output plugin skips all
> changes. We used the same interface for sending keep-alive message
> when we processed a lot of (DDL) changes without sending anything to
> plugin.
>
> [1] - https://www.postgresql.org/message-id/20200309183018.tzkzwu635sd366ej%40alap3.anarazel.de
> [2] - https://www.postgresql.org/message-id/CAMsr%2BYE3o8Dt890Q8wTooY2MpN0JvdHqUAHYL-LNhBryXOPaKg%40mail.gmail.com

I don't buy that this has to be done by the output plugin. The actual sending
out of data happens via the LogicalDecodingContext callbacks, so we very well
can know whether we recently did send out data or not.

This really is a concern of the LogicalDecodingContext, it has pretty much
nothing to do with output plugins.  We should remove all calls of
OutputPluginUpdateProgress() from pgoutput, and add the necessary calls to
LogicalDecodingContext->update_progress() to generic code. And

Additionally we should either rename WalSndUpdateProgress(), because it's now
doing *far* more than "updating progress", or alternatively, split it into two
functions.


I don't think the syncrep logic in WalSndUpdateProgress really works as-is -
consider what happens if e.g. the origin filter filters out entire
transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases
we'll be lucky because we'll return quickly to XLogSendLogical(), but not
reliably.


Greetings,

Andres Freund



Re: Logical replication timeout problem

От
Andres Freund
Дата:
Hi,

On 2023-02-08 10:18:41 -0800, Andres Freund wrote:
> I don't think the syncrep logic in WalSndUpdateProgress really works as-is -
> consider what happens if e.g. the origin filter filters out entire
> transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases
> we'll be lucky because we'll return quickly to XLogSendLogical(), but not
> reliably.

Is it actually the right thing to check SyncRepRequested() in that logic? It's
quite common to set up syncrep so that individual users or transactions opt
into syncrep, but to leave the default disabled.

I don't really see an alternative to making this depend solely on
sync_standbys_defined.

Greetings,

Andres Freund



Re: Logical replication timeout problem

От
Andres Freund
Дата:
Hi,

On 2023-02-08 10:30:37 -0800, Andres Freund wrote:
> On 2023-02-08 10:18:41 -0800, Andres Freund wrote:
> > I don't think the syncrep logic in WalSndUpdateProgress really works as-is -
> > consider what happens if e.g. the origin filter filters out entire
> > transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases
> > we'll be lucky because we'll return quickly to XLogSendLogical(), but not
> > reliably.
>
> Is it actually the right thing to check SyncRepRequested() in that logic? It's
> quite common to set up syncrep so that individual users or transactions opt
> into syncrep, but to leave the default disabled.
>
> I don't really see an alternative to making this depend solely on
> sync_standbys_defined.

Hacking on a rough prototype how I think this should rather look, I had a few
questions / remarks:

- We probably need to call UpdateProgress from a bunch of places in decode.c
  as well? Indicating that we're lagging by a lot, just because all
  transactions were in another database seems decidedly suboptimal.

- Why should lag tracking only be updated at commit like points? That seems
  like it adds odd discontinuinities?

- The mix of skipped_xact and ctx->end_xact in WalSndUpdateProgress() seems
  somewhat odd. They have very overlapping meanings IMO.

- there's no UpdateProgress calls in pgoutput_stream_abort(), but ISTM there
  should be? It's legit progress.

- That's from 6912acc04f0: I find LagTrackerRead(), LagTrackerWrite() quite
  confusing, naming-wise. IIUC "reading" is about receiving confirmation
  messages, "writing" about the time the record was generated.  ISTM that the
  current time is a quite poor approximation in XLogSendPhysical(), but pretty
  much meaningless in WalSndUpdateProgress()? Am I missing something?

- Aren't the wal_sender_timeout / 2 checks in WalSndUpdateProgress(),
  WalSndWriteData() missing wal_sender_timeout <= 0 checks?

- I don't really understand why f95d53edged55 added !end_xact to the if
  condition for ProcessPendingWrites(). Is the theory that we'll end up in an
  outer loop soon?


Attached is a current, quite rough, prototype. It addresses some of the points
raised, but far from all. There's also several XXXs/FIXMEs in it.  I changed
the file-ending to .txt to avoid hijacking the CF entry.

Greetings,

Andres Freund

Вложения

Rework LogicalOutputPluginWriterUpdateProgress (WAS Re: Logical replication timeout ...)

От
Amit Kapila
Дата:
On Thu, Feb 9, 2023 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hacking on a rough prototype how I think this should rather look, I had a few
> questions / remarks:
>
> - We probably need to call UpdateProgress from a bunch of places in decode.c
>   as well? Indicating that we're lagging by a lot, just because all
>   transactions were in another database seems decidedly suboptimal.
>

We can do that but I think in all those cases we will reach quickly
enough back to walsender logic (WalSndLoop - that will send keepalive
if required) that we don't need to worry. After processing each
record, the logic will return back to the main loop that will send
keepalive if required. Also, while reading WAL if we need to block, it
will call WalSndWaitForWal() which will send keepalive if required.
The real problem we have seen in the field reports or tests is that
when we process a large transaction where changes are queued in the
reorderbuffer and while processing those we discard all or most of the
changes.

The patch calls update_progress in change_cb_wrapper and other
wrappers which will miss the case of DDLs that generates a lot of data
that is not processed by the plugin. I think for that we either need
to call update_progress from reorderbuffer.c similar to what the patch
has removed or we need some other way to address it. Do you have any
better idea?

> - Why should lag tracking only be updated at commit like points? That seems
>   like it adds odd discontinuinities?
>

We have previously experimented to call it from non-commit locations
but that turned out to give inaccurate information about Lag. See
email [1].

> - The mix of skipped_xact and ctx->end_xact in WalSndUpdateProgress() seems
>   somewhat odd. They have very overlapping meanings IMO.
>
> - there's no UpdateProgress calls in pgoutput_stream_abort(), but ISTM there
>   should be? It's legit progress.
>

Agreed with both of the above points.

> - That's from 6912acc04f0: I find LagTrackerRead(), LagTrackerWrite() quite
>   confusing, naming-wise. IIUC "reading" is about receiving confirmation
>   messages, "writing" about the time the record was generated.  ISTM that the
>   current time is a quite poor approximation in XLogSendPhysical(), but pretty
>   much meaningless in WalSndUpdateProgress()? Am I missing something?
>

Leaving it for Thomas to answer.

> - Aren't the wal_sender_timeout / 2 checks in WalSndUpdateProgress(),
>   WalSndWriteData() missing wal_sender_timeout <= 0 checks?
>

It seems we are checking that via
ProcessPendingWrites()->WalSndKeepaliveIfNecessary(). Do you think we
need to check it before as well?

> - I don't really understand why f95d53edged55 added !end_xact to the if
>   condition for ProcessPendingWrites(). Is the theory that we'll end up in an
>   outer loop soon?
>

Yes. For non-empty xacts, we will anyway send a commit message. For
empty (skipped) xacts, we will send for synchronous replication case
to avoid any delay.

>
> Attached is a current, quite rough, prototype. It addresses some of the points
> raised, but far from all. There's also several XXXs/FIXMEs in it.  I changed
> the file-ending to .txt to avoid hijacking the CF entry.
>

I have started a separate thread to avoid such confusion. I hope that
is fine with you.

> > > I don't think the syncrep logic in WalSndUpdateProgress really works as-is -
> > > consider what happens if e.g. the origin filter filters out entire
> > > transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases
> > > we'll be lucky because we'll return quickly to XLogSendLogical(), but not
> > > reliably.
> >

Which case are you worried about? As mentioned in one of the previous
points I thought the timeout/keepalive handling in the callers should
be enough.

> > Is it actually the right thing to check SyncRepRequested() in that logic? It's
> > quite common to set up syncrep so that individual users or transactions opt
> > into syncrep, but to leave the default disabled.
> >
> > I don't really see an alternative to making this depend solely on
> > sync_standbys_defined.

Fair point.

How about renaming ProcessPendingWrites to WaitToSendPendingWrites or
WalSndWaitToSendPendingWrites?

[1] -
https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.



Re: Rework LogicalOutputPluginWriterUpdateProgress (WAS Re: Logical replication timeout ...)

От
Amit Kapila
Дата:
On Thu, Feb 9, 2023 at 11:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> How about renaming ProcessPendingWrites to WaitToSendPendingWrites or
> WalSndWaitToSendPendingWrites?
>

How about renaming WalSndUpdateProgress() to
WalSndUpdateProgressAndSendKeepAlive() or
WalSndUpdateProgressAndKeepAlive()?

One thing to note about the changes we are discussing here is that
some of the plugins like wal2json already call
OutputPluginUpdateProgress in their commit callback. They may need to
update it accordingly.

One difference I see with the patch is that I think we will end up
sending keepalive for empty prepared transactions even though we don't
skip sending begin/prepare messages for those. The reason why we don't
skip sending prepare for empty 2PC xacts is that if the WALSender
restarts after the PREPARE of a transaction and before the COMMIT
PREPARED of the same transaction then we won't be able to figure out
if we have skipped sending BEGIN/PREPARE of a transaction. To skip
sending prepare for empty xacts, we previously thought of some ideas
like (a) At commit-prepare time have a check on the subscriber-side to
know whether there is a corresponding prepare for it before actually
doing commit-prepare but that sounded costly. (b) somehow persist the
information whether the PREPARE for a xact is already sent and then
use that information for commit prepared but again that also didn't
sound like a good idea.

-- 
With Regards,
Amit Kapila.



Re: Logical replication timeout problem

От
"Gregory Stark (as CFM)"
Дата:
On Wed, 8 Feb 2023 at 15:04, Andres Freund <andres@anarazel.de> wrote:
>
> Attached is a current, quite rough, prototype. It addresses some of the points
> raised, but far from all. There's also several XXXs/FIXMEs in it.  I changed
> the file-ending to .txt to avoid hijacking the CF entry.

It looks like this patch has received quite a generous helping of
feedback from Andres. I'm setting it to Waiting on Author.

On the one hand it looks like there's a lot of work to do on this but
on the other hand it sounds like this is a live problem in the field
so if it can get done in time for release that would be great but if
not then feel free to move it to the next commitfest (which means next
release).


-- 
Gregory Stark
As Commitfest Manager



RE: Logical replication timeout problem

От
"wangw.fnst@fujitsu.com"
Дата:
On Thu, Mar 2, 2023 at 4:19 AM Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:    
> On Wed, 8 Feb 2023 at 15:04, Andres Freund <andres@anarazel.de> wrote:
> >
> > Attached is a current, quite rough, prototype. It addresses some of the points
> > raised, but far from all. There's also several XXXs/FIXMEs in it.  I changed
> > the file-ending to .txt to avoid hijacking the CF entry.
> 
> It looks like this patch has received quite a generous helping of
> feedback from Andres. I'm setting it to Waiting on Author.
> 
> On the one hand it looks like there's a lot of work to do on this but
> on the other hand it sounds like this is a live problem in the field
> so if it can get done in time for release that would be great but if
> not then feel free to move it to the next commitfest (which means next
> release).

Hi,

Since this patch is an improvement to the architecture in HEAD, we started
another new thread [1] on this topic to develop related patch.

It seems that we could modify the details of this CF entry to point to the new
thread and change the status to 'Needs Review'.

[1] - https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de

Regards,
Wang Wei