Обсуждение: Logical replication timeout problem
On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > Hi, > > Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker processexit with an exit code 1 > > 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local] LOG: duration: 1281408.171 ms statement:COPY schem.tab (col1, col2) FROM stdin; > 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "foo.schem.tab" system usage:CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s > 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout > 2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG: worker process: logical replication worker for subscription24106654 (PID 3770) exited with exit code 1 > 2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG: logical replication apply worker for subscription "subxxxx"has started > 2021-09-16 12:07:50 CEST [13873]: [1-1] user=repuser,db=foo,client=127.0.0.1 LOG: received replication command: IDENTIFY_SYSTEM > Can you share the publisher-side log as well? -- With Regards, Amit Kapila.
On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hi,
>
> Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker process exit with an exit code 1
>
> 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local] LOG: duration: 1281408.171 ms statement: COPY schem.tab (col1, col2) FROM stdin;
> 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
> 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout
> 2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG: worker process: logical replication worker for subscription 24106654 (PID 3770) exited with exit code 1
> 2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG: logical replication apply worker for subscription "subxxxx" has started
> 2021-09-16 12:07:50 CEST [13873]: [1-1] user=repuser,db=foo,client=127.0.0.1 LOG: received replication command: IDENTIFY_SYSTEM
>
Can you share the publisher-side log as well?
--
With Regards,
Amit Kapila.
On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > the publisher and the subscriber run on the same postgres instance. > Okay, but there is no log corresponding to operations being performed by the publisher. By looking at current logs it is not very clear to me what might have caused this. Did you try increasing wal_sender_timeout and wal_receiver_timeout? -- With Regards, Amit Kapila.
On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> the publisher and the subscriber run on the same postgres instance.
>
Okay, but there is no log corresponding to operations being performed
by the publisher. By looking at current logs it is not very clear to
me what might have caused this. Did you try increasing
wal_sender_timeout and wal_receiver_timeout?
--
With Regards,
Amit Kapila.
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > Hi Amit, > > We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's timeoutafter one minute from the end of this load. The table on which this load is executed is not replicated. > > 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms statement:COPY db.table (col1, col2) FROM stdin; > > 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "db.table " system usage: CPU:user: 4.13 s, system: 0.55 s, elapsed: 9.58 s > > 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout > > Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanismsleading to this timeout. > The basic problem here seems to be that WAL Sender is not able to send a keepalive or any other message for the configured wal_receiver_timeout. I am not sure how that can happen but can you once try by switching autovacuum = off? I wanted to ensure that WALSender is not blocked due to the background process autovacuum. -- With Regards, Amit Kapila.
On Mon, Sep 20, 2021 at 5:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > > > Hi Amit, > > > > We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker'stimeout after one minute from the end of this load. The table on which this load is executed is not replicated. > > > > 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms statement:COPY db.table (col1, col2) FROM stdin; > > > > 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "db.table " system usage:CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s > > > > 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout > > > > Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanismsleading to this timeout. > > > > The basic problem here seems to be that WAL Sender is not able to send > a keepalive or any other message for the configured > wal_receiver_timeout. I am not sure how that can happen but can you > once try by switching autovacuum = off? I wanted to ensure that > WALSender is not blocked due to the background process autovacuum. > The other thing we can try out is to check the data in pg_locks on publisher during one minute after the large copy is finished. This we can try out both with and without autovacuum. -- With Regards, Amit Kapila.
On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hi Amit,
>
> We can replay the problem: we load a table of several Gb in the schema of the publisher, this generates the worker's timeout after one minute from the end of this load. The table on which this load is executed is not replicated.
>
> 2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=db012a00,client=[local] LOG: duration: 1281408.171 ms statement: COPY db.table (col1, col2) FROM stdin;
>
> 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG: automatic analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55 s, elapsed: 9.58 s
>
> 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR: terminating logical replication worker due to timeout
>
> Before increasing value for wal_sender_timeout and wal_receiver_timeout I thought to further investigate the mechanisms leading to this timeout.
>
The basic problem here seems to be that WAL Sender is not able to send
a keepalive or any other message for the configured
wal_receiver_timeout. I am not sure how that can happen but can you
once try by switching autovacuum = off? I wanted to ensure that
WALSender is not blocked due to the background process autovacuum.
--
With Regards,
Amit Kapila.
On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous tests.However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb. > > ... > -rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap > -rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap > -rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap > -rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap > -rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap > -rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap > -rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap > -rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap > -rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap > ... > Okay, still not sure why the publisher is not sending keep_alive messages in between spilling such a big transaction. If you see, we have logic in WalSndLoop() wherein each time after sending data we check whether we need to send a keep-alive message via function WalSndKeepaliveIfNecessary(). I think to debug this problem further you need to add some logs in function WalSndKeepaliveIfNecessary() to see why it is not sending keep_alive messages when all these files are being created. Did you change the default value of wal_sender_timeout/wal_receiver_timeout? What is the value of those variables in your environment? Did you see the message "terminating walsender process due to replication timeout" in your server logs? -- With Regards, Amit Kapila.
On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> By passing the autovacuum parameter to off the problem did not occur right after loading the table as in our previous tests. However, the timeout occurred later. We have seen the accumulation of .snap files for several Gb.
>
> ...
> -rw-------. 1 postgres postgres 16791226 Sep 20 15:26 xid-1238444701-lsn-2D2B-F5000000.snap
> -rw-------. 1 postgres postgres 16973268 Sep 20 15:26 xid-1238444701-lsn-2D2B-F6000000.snap
> -rw-------. 1 postgres postgres 16790984 Sep 20 15:26 xid-1238444701-lsn-2D2B-F7000000.snap
> -rw-------. 1 postgres postgres 16988112 Sep 20 15:26 xid-1238444701-lsn-2D2B-F8000000.snap
> -rw-------. 1 postgres postgres 16864593 Sep 20 15:26 xid-1238444701-lsn-2D2B-F9000000.snap
> -rw-------. 1 postgres postgres 16902167 Sep 20 15:26 xid-1238444701-lsn-2D2B-FA000000.snap
> -rw-------. 1 postgres postgres 16914638 Sep 20 15:26 xid-1238444701-lsn-2D2B-FB000000.snap
> -rw-------. 1 postgres postgres 16782471 Sep 20 15:26 xid-1238444701-lsn-2D2B-FC000000.snap
> -rw-------. 1 postgres postgres 16963667 Sep 20 15:27 xid-1238444701-lsn-2D2B-FD000000.snap
> ...
>
Okay, still not sure why the publisher is not sending keep_alive
messages in between spilling such a big transaction. If you see, we
have logic in WalSndLoop() wherein each time after sending data we
check whether we need to send a keep-alive message via function
WalSndKeepaliveIfNecessary(). I think to debug this problem further
you need to add some logs in function WalSndKeepaliveIfNecessary() to
see why it is not sending keep_alive messages when all these files are
being created.
Did you change the default value of
wal_sender_timeout/wal_receiver_timeout? What is the value of those
variables in your environment? Did you see the message "terminating
walsender process due to replication timeout" in your server logs?
--
With Regards,
Amit Kapila.
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason? > ... > * Check for replication timeout. */ > WalSndCheckTimeOut(); > > /* Send keepalive if the time has come */ > WalSndKeepaliveIfNecessary(); > ... > Are you sure that these functions have not been called? Or the case is that these are called but due to some reason the keep-alive is not sent? IIUC, these are called after processing each WAL record so not sure how is it possible in your case that these are not reached? > The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked toan insert is copied to snap files given that table does not take part of the logical replication. > It is because we don't know till the end of the transaction (where we start sending the data) whether the table will be replicated or not. I think specifically for this purpose the new 'streaming' feature introduced in PG-14 will help us to avoid writing data of such tables to snap/spill files. See 'streaming' option in Create Subscription docs [1]. > We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem isthat these parameters are global and changing them will also impact the physical replication. > Do you mean you are planning to change from 1 minute to 5 minutes? I agree with the global nature of parameters and I think your approach to finding out the root cause is good here because otherwise, under some similar or more heavy workload, it might lead to the same situation. > Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associatedwith it. > Right, I know that but I was curious to know if the walsender has exited before walreceiver. [1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html -- With Regards, Amit Kapila.
sure how is it possible in your case that these are not reached?
On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I understand, the instruction to send keep alive by the wal sender has not been reached in the for loop, for what reason?
> ...
> * Check for replication timeout. */
> WalSndCheckTimeOut();
>
> /* Send keepalive if the time has come */
> WalSndKeepaliveIfNecessary();
> ...
>
Are you sure that these functions have not been called? Or the case is
that these are called but due to some reason the keep-alive is not
sent? IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?
> The data load is performed on a table which is not replicated, I do not understand why the whole transaction linked to an insert is copied to snap files given that table does not take part of the logical replication.
>
It is because we don't know till the end of the transaction (where we
start sending the data) whether the table will be replicated or not. I
think specifically for this purpose the new 'streaming' feature
introduced in PG-14 will help us to avoid writing data of such tables
to snap/spill files. See 'streaming' option in Create Subscription
docs [1].
> We are going to do a test by modifying parameters wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that these parameters are global and changing them will also impact the physical replication.
>
Do you mean you are planning to change from 1 minute to 5 minutes? I
agree with the global nature of parameters and I think your approach
to finding out the root cause is good here because otherwise, under
some similar or more heavy workload, it might lead to the same
situation.
> Concerning the walsender timeout, when the worker is started again after a timeout, it will trigger a new walsender associated with it.
>
Right, I know that but I was curious to know if the walsender has
exited before walreceiver.
[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html
--
With Regards,
Amit Kapila.
On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > > IIUC, these are called after processing each WAL record so not > sure how is it possible in your case that these are not reached? > > I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function > > > I was curious to know if the walsender has exited before walreceiver > > During the last tests we made we didn't observe any timeout of the wal sender process. > > > Do you mean you are planning to change from 1 minute to 5 minutes? > > We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather positivethere is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes. > How to explain that behaviour, why the snap files are consumed suddenly so quickly. > I think it is because we decide that the data in those snap files doesn't need to be sent at xact end, so we remove them. > I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate fromyour point of view? > It is difficult to say what is the appropriate value for these parameters unless in some way we debug WalSndKeepaliveIfNecessary() to find why it didn't send keep alive when it is expected. Would you be able to make code changes and test or if you want I can make changes and send the patch if you can test it? If not, is it possible that in some way you send a reproducible test? -- With Regards, Amit Kapila.
On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> > IIUC, these are called after processing each WAL record so not
> sure how is it possible in your case that these are not reached?
>
> I don't know, as you say, to highlight the problem we would have to debug the WalSndKeepaliveIfNecessary function
>
> > I was curious to know if the walsender has exited before walreceiver
>
> During the last tests we made we didn't observe any timeout of the wal sender process.
>
> > Do you mean you are planning to change from 1 minute to 5 minutes?
>
> We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test. The result is surprising and rather positive there is no timeout any more in the log and the 20Gb of snap files are removed in less than 5 minutes.
> How to explain that behaviour, why the snap files are consumed suddenly so quickly.
>
I think it is because we decide that the data in those snap files
doesn't need to be sent at xact end, so we remove them.
> I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout parameters, are theses values appropriate from your point of view?
>
It is difficult to say what is the appropriate value for these
parameters unless in some way we debug WalSndKeepaliveIfNecessary() to
find why it didn't send keep alive when it is expected. Would you be
able to make code changes and test or if you want I can make changes
and send the patch if you can test it? If not, is it possible that in
some way you send a reproducible test?
--
With Regards,
Amit Kapila.
On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > If you would like I can test the patch you send to me. > Okay, please find an attached patch for additional logs. I would like to see the logs during the time when walsender appears to be writing to files. We might need to add more logs to find the exact problem but let's start with this. -- With Regards, Amit Kapila.
Вложения
On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If you would like I can test the patch you send to me.
>
Okay, please find an attached patch for additional logs. I would like
to see the logs during the time when walsender appears to be writing
to files. We might need to add more logs to find the exact problem but
let's start with this.
--
With Regards,
Amit Kapila.
On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Thanks for your patch, we are going to set up a lab in order to debug the function.
Hi
I tried to reproduce this timeout problem on version10.18 but failed.
In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.
And with the patch provided by Amit, I saw that the frequency of invoking
WalSndKeepaliveIfNecessary function is raised after I inserted data.
The test script is attached. Maybe you can try it on your machine and check if this problem could happen.
If I miss something in the script, please let me know.
Of course, it will be better if you can provide your script to reproduce the problem.
Regards
Tang
Вложения
On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Thanks for your patch, we are going to set up a lab in order to debug the function.
Hi
I tried to reproduce this timeout problem on version10.18 but failed.
In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.
And with the patch provided by Amit, I saw that the frequency of invoking
WalSndKeepaliveIfNecessary function is raised after I inserted data.
The test script is attached. Maybe you can try it on your machine and check if this problem could happen.
If I miss something in the script, please let me know.
Of course, it will be better if you can provide your script to reproduce the problem.
Regards
Tang
I activate the physical replication between 2 nodes, and I got following error:
2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive message
2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION 0/3000000 TIMELINE 1
2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical replication worker due to administrator command
2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical replication worker for subscription 16413 (PID 12064) exited with exit code 1
2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive message
Thanks Tang for your script.Our debugging environment will be ready soon. I will test your script and we will try to reproduce the problem by integrating the patch provided by Amit. As soon as I have results I will let you know.RegardsFabriceOn Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Thanks for your patch, we are going to set up a lab in order to debug the function.
Hi
I tried to reproduce this timeout problem on version10.18 but failed.
In my trial, I inserted large amounts of data at publisher, which took more than 1 minute to replicate.
And with the patch provided by Amit, I saw that the frequency of invoking
WalSndKeepaliveIfNecessary function is raised after I inserted data.
The test script is attached. Maybe you can try it on your machine and check if this problem could happen.
If I miss something in the script, please let me know.
Of course, it will be better if you can provide your script to reproduce the problem.
Regards
Tang
On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > Hello, > Our lab is ready now. Amit, I compile Postgres 10.18 with your patch.Tang, I used your script to configure logical replicationbetween 2 databases and to generate 10 million entries in an unreplicated foo table. On a standalone instanceno error message appears in log. > I activate the physical replication between 2 nodes, and I got following error: > > 2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive message > 2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION 0/3000000 TIMELINE 1 > 2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical replication worker due to administrator command > 2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical replication worker for subscription 16413 (PID 12064)exited with exit code 1 > 2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive message > > This message look like strange because no admin command have been executed during data load. > I did not find any error related to the timeout. > The message coming from the modification made with the patch comes back all the time: attempt to send keep alive message.But there is no "sent keep alive message". > > Why logical replication worker exit when physical replication is configured? > I am also not sure why that happened may be due to max_worker_processes reaching its limit. This can happen because it seems you configured both publisher and subscriber in the same cluster. Tang, did you also see the same problem? BTW, why are you bringing physical standby configuration into the test? Does in your original setup where you observe the problem the physical standbys were there? -- With Regards, Amit Kapila.
On Friday, November 12, 2021 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis > <fabrice636861@gmail.com> wrote: > > > > Hello, > > Our lab is ready now. Amit, I compile Postgres 10.18 with your patch.Tang, I > used your script to configure logical replication between 2 databases and to > generate 10 million entries in an unreplicated foo table. On a standalone instance > no error message appears in log. > > I activate the physical replication between 2 nodes, and I got following error: > > > > 2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive > message > > 2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION > 0/3000000 TIMELINE 1 > > 2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical replication > worker due to administrator command > > 2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical replication > worker for subscription 16413 (PID 12064) exited with exit code 1 > > 2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive > message > > > > This message look like strange because no admin command have been executed > during data load. > > I did not find any error related to the timeout. > > The message coming from the modification made with the patch comes back all > the time: attempt to send keep alive message. But there is no "sent keep alive > message". > > > > Why logical replication worker exit when physical replication is configured? > > > > I am also not sure why that happened may be due to > max_worker_processes reaching its limit. This can happen because it > seems you configured both publisher and subscriber in the same > cluster. Tang, did you also see the same problem? > No. I used the default max_worker_processes value, ran logical replication and physical replication at the same time. I also changed the data in table on publisher. But didn't see the same problem. Regards Tang
Yes, on the original environment there is physical replication, that's why for the lab I configured 2 nodes with physical replication.
On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
<fabrice636861@gmail.com> wrote:
>
> Hello,
> Our lab is ready now. Amit, I compile Postgres 10.18 with your patch.Tang, I used your script to configure logical replication between 2 databases and to generate 10 million entries in an unreplicated foo table. On a standalone instance no error message appears in log.
> I activate the physical replication between 2 nodes, and I got following error:
>
> 2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive message
> 2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION 0/3000000 TIMELINE 1
> 2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical replication worker due to administrator command
> 2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical replication worker for subscription 16413 (PID 12064) exited with exit code 1
> 2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive message
>
> This message look like strange because no admin command have been executed during data load.
> I did not find any error related to the timeout.
> The message coming from the modification made with the patch comes back all the time: attempt to send keep alive message. But there is no "sent keep alive message".
>
> Why logical replication worker exit when physical replication is configured?
>
I am also not sure why that happened may be due to
max_worker_processes reaching its limit. This can happen because it
seems you configured both publisher and subscriber in the same
cluster. Tang, did you also see the same problem?
BTW, why are you bringing physical standby configuration into the
test? Does in your original setup where you observe the problem the
physical standbys were there?
--
With Regards,
Amit Kapila.
On Wed, Dec 22, 2021 at 8:50 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > Hello Amit, > > I was able to reproduce the timeout problem in the lab. > After loading more than 20 millions of rows in a table which is not replicated (insert command ends without error), errorsrelated to logical replication processes appear in the postgres log. > Approximately every 5 minutes worker process is restarted. The snap files in the slot directory are still present. Thereplication system seems to be blocked. Why these snap files are not removed. What do they contain? > These contain changes of insert. I think these are not removed for your case as your long transaction is never finished. As mentioned earlier, for such cases, it is better to use 'streaming' feature released as part of PG-14 but anyway here we are trying to debug timeout problem. > I will recompile postgres with your patch to debug. > Okay, that might help. -- With Regards, Amit Kapila.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION: ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION: ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION: ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:35.341 CET [55794] LOCATION: WalSndKeepalive, walsender.c:3365
rc = WaitLatchOrSocket(MyLatch, | |
WL_SOCKET_READABLE | WL_LATCH_SET | | |
WL_TIMEOUT | WL_POSTMASTER_DEATH, | |
fd, wait_time, | |
WAIT_EVENT_LOGICAL_APPLY_MAIN); |
On Wed, Dec 22, 2021 at 8:50 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> Hello Amit,
>
> I was able to reproduce the timeout problem in the lab.
> After loading more than 20 millions of rows in a table which is not replicated (insert command ends without error), errors related to logical replication processes appear in the postgres log.
> Approximately every 5 minutes worker process is restarted. The snap files in the slot directory are still present. The replication system seems to be blocked. Why these snap files are not removed. What do they contain?
>
These contain changes of insert. I think these are not removed for
your case as your long transaction is never finished. As mentioned
earlier, for such cases, it is better to use 'streaming' feature
released as part of PG-14 but anyway here we are trying to debug
timeout problem.
> I will recompile postgres with your patch to debug.
>
Okay, that might help.
--
With Regards,
Amit Kapila.
I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.2021-12-28 10:50:01.894 CET [55792] LOCATION: WalSndKeepalive, walsender.c:3365...2021-12-28 10:50:31.854 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION: ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION: ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION: ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk...2021-12-28 10:50:35.341 CET [55794] DEBUG: 00000: sending replication keepalive
2021-12-28 10:50:35.341 CET [55794] LOCATION: WalSndKeepalive, walsender.c:3365...2021-12-28 10:51:31.995 CET [55791] ERROR: XX000: terminating logical replication worker due to timeout2021-12-28 10:51:31.995 CET [55791] LOCATION: LogicalRepApplyLoop, worker.c:1267
Could this function in Apply main loop in worker.c help to find a solution?
rc = WaitLatchOrSocket(MyLatch, WL_SOCKET_READABLE | WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, fd, wait_time, WAIT_EVENT_LOGICAL_APPLY_MAIN);
In a first phase, snap files are generated and stored in pg_replslot. This process end when1420 files are present in pg_replslots (this is in relation with statements that must be replayed from WAL). In the pg_stat_replication view, the state field is set to catchup.
In a 2nd phase, the snap files must be decoded. However after one minute (wal_receiver_timeout parameter set to 1 minute) the worker process stop with a timeout.
Thank you for your help
On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts, the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.2021-12-28 10:50:01.894 CET [55792] LOCATION: WalSndKeepalive, walsender.c:3365...2021-12-28 10:50:31.854 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION: ShowTransactionStateRec, xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: spill 2271 changes in XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION: ReorderBufferSerializeTXN, reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.110 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk
2021-12-28 10:50:32.110 CET [55792] LOCATION: ReorderBufferIterTXNNext, reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG: 00000: restored 4096/22603999 changes from disk...2021-12-28 10:50:35.341 CET [55794] DEBUG: 00000: sending replication keepalive
2021-12-28 10:50:35.341 CET [55794] LOCATION: WalSndKeepalive, walsender.c:3365...2021-12-28 10:51:31.995 CET [55791] ERROR: XX000: terminating logical replication worker due to timeout2021-12-28 10:51:31.995 CET [55791] LOCATION: LogicalRepApplyLoop, worker.c:1267It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?Could this function in Apply main loop in worker.c help to find a solution?
rc = WaitLatchOrSocket(MyLatch, WL_SOCKET_READABLE | WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, fd, wait_time, WAIT_EVENT_LOGICAL_APPLY_MAIN); Can you explain why you think this will help in solving your current problem?--With Regards,Amit Kapila.
Can you explain why you think this will help in solving your current problem?Indeed your are right this function won't help, we have to look elsewhere.It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This process end when1420 files are present in pg_replslots (this is in relation with statements that must be replayed from WAL). In the pg_stat_replication view, the state field is set to catchup.
In a 2nd phase, the snap files must be decoded. However after one minute (wal_receiver_timeout parameter set to 1 minute) the worker process stop with a timeout.
Amit Kapila.
On Tue, Jan 11, 2022 at 8:13 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:Can you explain why you think this will help in solving your current problem?Indeed your are right this function won't help, we have to look elsewhere.It is still not clear to me why the problem happened? IIUC, after restoring 4096 changes from snap files, we send them to the subscriber, and then apply worker should apply those one by one. Now, is it taking one minute to restore 4096 changes due to which apply worker is timed out?Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This process end when1420 files are present in pg_replslots (this is in relation with statements that must be replayed from WAL). In the pg_stat_replication view, the state field is set to catchup.
In a 2nd phase, the snap files must be decoded. However after one minute (wal_receiver_timeout parameter set to 1 minute) the worker process stop with a timeout.What exactly do you mean by the first and second phase in the above description?--With Regards,
Amit Kapila.
On Thu, Jan 13, 2022 at 3:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > first phase: postgres read WAL files and generate 1420 snap files. > second phase: I guess, but on this point maybe you can clarify, postgres has to decode the snap files and remove them ifno statement must be applied on a replicated table. > It is from this point that the worker process exit after 1 minute timeout. > Okay, I think the problem could be that because we are skipping all the changes of transaction there is no communication sent to the subscriber and it eventually timed out. Actually, we try to send keep-alive at transaction boundaries like when we call pgoutput_commit_txn. The pgoutput_commit_txn will call OutputPluginWrite->WalSndWriteData. I think to tackle the problem we need to try to send such keepalives via WalSndUpdateProgress and invoke that in pgoutput_change when we skip sending the change. -- With Regards, Amit Kapila.
static void
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
{
static TimestampTz sendTime = 0;
TimestampTz now = GetCurrentTimestamp();
/* Keep the worker process alive */
WalSndKeepalive(true);
* avoid flooding the lag tracker when we commit frequently.
*/
#define WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS 1000
if (!TimestampDifferenceExceeds(sendTime, now,
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS))
return;
LagTrackerWrite(lsn, now);
sendTime = now;
}
2. In pgoutput.c
/*
* Sends the decoded DML over wire.
*
* This is called both in streaming and non-streaming modes.
*/
static void
pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
MemoryContext old;
RelationSyncEntry *relentry;
TransactionId xid = InvalidTransactionId;
Relation ancestor = NULL;
WalSndUpdateProgress(ctx, txn->origin_lsn, change->txn->xid);
if (!is_publishable_relation(relation))
return;
2022-01-13 11:19:46.340 CET [82233] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2022-01-13 11:19:46.340 CET [82233] LOG: 00000: attempt to send keep alive message
2022-01-13 11:19:46.340 CET [82233] LOCATION: WalSndKeepaliveIfNecessary, walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2022-01-13 11:19:46.340 CET [82233] LOG: 00000: attempt to send keep alive message
2022-01-13 11:19:46.340 CET [82233] LOCATION: WalSndKeepaliveIfNecessary, walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT: START_REPLICATION SLOT "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names '"pub008_s012a00"')
2022-01-13 11:20:46.418 CET [82232] ERROR: XX000: terminating logical replication worker due to timeout
2022-01-13 11:20:46.418 CET [82232] LOCATION: LogicalRepApplyLoop, worker.c:1267
2022-01-13 11:20:46.421 CET [82224] LOG: 00000: worker process: logical replication worker for subscription 26994 (PID 82232) exited with exit code 1
2022-01-13 11:20:46.421 CET [82224] LOCATION: LogChildExit, postmaster.c:3625
On Thu, Jan 13, 2022 at 3:43 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> first phase: postgres read WAL files and generate 1420 snap files.
> second phase: I guess, but on this point maybe you can clarify, postgres has to decode the snap files and remove them if no statement must be applied on a replicated table.
> It is from this point that the worker process exit after 1 minute timeout.
>
Okay, I think the problem could be that because we are skipping all
the changes of transaction there is no communication sent to the
subscriber and it eventually timed out. Actually, we try to send
keep-alive at transaction boundaries like when we call
pgoutput_commit_txn. The pgoutput_commit_txn will call
OutputPluginWrite->WalSndWriteData. I think to tackle the problem we
need to try to send such keepalives via WalSndUpdateProgress and
invoke that in pgoutput_change when we skip sending the change.
--
With Regards,
Amit Kapila.
On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > If I can follow you, I have to make the following changes: > No, not like that but we can try that way as well to see if that helps to avoid your problem. Am, I understanding correctly even after modification, you are seeing the problem. Can you try by calling WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()? -- With Regards, Amit Kapila.
On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I can follow you, I have to make the following changes:
>
No, not like that but we can try that way as well to see if that helps
to avoid your problem. Am, I understanding correctly even after
modification, you are seeing the problem. Can you try by calling
WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?
--
With Regards,
Amit Kapila.
Hello Amit,
If it takes little work for you, can you please send me a piece of code
with the change needed to do the test
Thanks
Regards,
Fabrice
On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
>
> If I can follow you, I have to make the following changes:
>
No, not like that but we can try that way as well to see if that helps
to avoid your problem. Am, I understanding correctly even after
modification, you are seeing the problem. Can you try by calling
WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?
--
With Regards,
Amit Kapila.
On Wed, Jan 19, 2022 at 9:53 PM Fabrice Chapuis fabrice636861@gmail.com wrote:
> Hello Amit,
> If it takes little work for you, can you please send me a piece of code
> with the change needed to do the test
I wrote a patch(Send-keepalive.patch, please refer to attachment) according to
Amit's suggestions. But after I did some simple test about this patch by the
test script "test.sh"(please refer to attachment), I found the timeout problem
has not been fixed by this patch.
So I add some logs(please refer to Add-some-logs-to-debug.patch) to confirm newly
added WalSndKeepaliveIfNecessary() send keepalive message or not.
After applying the Send-keepalive.patch and Add-some-logs-to-debug.patch, I
found that the added message "send keep alive message" was not printed in
publisher-side log.
[publisher-side log]:
2022-01-20 15:21:50.057 CST [2400278] LOG: checkpoint complete: wrote 61 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=9.838 s, sync=0.720 s, total=10.559 s; sync files=4, longest=0.563 s, average=0.180 s; distance=538053 kB, estimate=543889 kB
2022-01-20 15:21:50.977 CST [2400278] LOG: checkpoints are occurring too frequently (11 seconds apart)
2022-01-20 15:21:50.977 CST [2400278] HINT: Consider increasing the configuration parameter "max_wal_size".
2022-01-20 15:21:50.988 CST [2400278] LOG: checkpoint starting: wal
2022-01-20 15:21:52.853 CST [2400404] LOG: begin load changes
2022-01-20 15:21:52.853 CST [2400404] STATEMENT: START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', publication_names '"pub"')
2022-01-20 15:22:52.969 CST [2410649] ERROR: replication slot "sub" is active for PID 2400404
2022-01-20 15:22:52.969 CST [2410649] STATEMENT: START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', publication_names '"pub"')
2022-01-20 15:22:57.980 CST [2410657] ERROR: replication slot "sub" is active for PID 2400404
[subscriber-side log]:
2022-01-20 15:16:10.975 CST [2400335] LOG: checkpoint starting: time
2022-01-20 15:16:16.052 CST [2400335] LOG: checkpoint complete: wrote 51 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.830 s, sync=0.135 s, total=5.078 s; sync files=39, longest=0.079 s, average=0.004 s; distance=149 kB, estimate=149 kB
2022-01-20 15:22:52.738 CST [2400400] ERROR: terminating logical replication worker due to timeout
2022-01-20 15:22:52.738 CST [2400332] LOG: background worker "logical replication worker" (PID 2400400) exited with exit code 1
2022-01-20 15:22:52.740 CST [2410648] LOG: logical replication apply worker for subscription "sub" has started
2022-01-20 15:22:52.969 CST [2410648] ERROR: could not start WAL streaming: ERROR: replication slot "sub" is active for PID 2400404
2022-01-20 15:22:52.970 CST [2400332] LOG: background worker "logical replication worker" (PID 2410648) exited with exit code 1
2022-01-20 15:22:57.977 CST [2410656] LOG: logical replication apply worker for subscription "sub" has started
It seems WalSndKeepaliveIfNecessary did not send keepalive message in testing. I
am still doing some research about the cause.
Attach the patches and test script mentioned above, in case someone wants to try.
If I miss something, please let me know.
Regards,
Wang wei
Вложения
On Thu, Jan 20, 2022 at 2:35 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Wed, Jan 19, 2022 at 9:53 PM Fabrice Chapuis fabrice636861@gmail.com wrote: > > > Hello Amit, > > > If it takes little work for you, can you please send me a piece of code > > > with the change needed to do the test > > > > I wrote a patch(Send-keepalive.patch, please refer to attachment) according to > > Amit's suggestions. But after I did some simple test about this patch by the > > test script "test.sh"(please refer to attachment), I found the timeout problem > > has not been fixed by this patch. > > > > So I add some logs(please refer to Add-some-logs-to-debug.patch) to confirm newly > > added WalSndKeepaliveIfNecessary() send keepalive message or not. > > > > After applying the Send-keepalive.patch and Add-some-logs-to-debug.patch, I > > found that the added message "send keep alive message" was not printed in > > publisher-side log. > It might be not reaching the actual send_keep_alive logic in WalSndKeepaliveIfNecessary because of below code: { ... /* * Don't send keepalive messages if timeouts are globally disabled or * we're doing something not partaking in timeouts. */ if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; .. } I think you can add elog before the above return and before updating progress in the below code: case REORDER_BUFFER_CHANGE_INSERT: if (!relentry->pubactions.pubinsert) + { + OutputPluginUpdateProgress(ctx); return; This will help us to rule out one possibility. -- With Regards, Amit Kapila.
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > It might be not reaching the actual send_keep_alive logic in > WalSndKeepaliveIfNecessary because of below code: > { > ... > /* > * Don't send keepalive messages if timeouts are globally disabled or > * we're doing something not partaking in timeouts. > */ > if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; .. > } > > I think you can add elog before the above return and before updating progress > in the below code: > case REORDER_BUFFER_CHANGE_INSERT: > if (!relentry->pubactions.pubinsert) > + { > + OutputPluginUpdateProgress(ctx); > return; > > This will help us to rule out one possibility. Thanks for your advices! According to your advices, I applied 0001,0002 and 0003 to run the test script. When subscriber timeout, I filter publisher-side log: $ grep "before invoking update progress" pub.log | wc -l 60373557 $ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l 0 $ grep "return because waiting_for_ping_response" pub.log | wc -l 0 Based on this result, I think function WalSndKeepaliveIfNecessary was invoked, but function WalSndKeepalive was not invoked because (last_processing >= ping_time) is false. So I tried to see changes about last_processing and last_reply_timestamp (because ping_time is based on last_reply_timestamp). I found last_processing and last_reply_timestamp is set in function ProcessRepliesIfAny. last_processing is set to the time when function ProcessRepliesIfAny is invoked. Only when publisher receive a response from subscriber, last_reply_timestamp is set to last_processing and the flag waiting_for_ping_response is reset to false. When we are during the loop to skip all the changes of transaction, IIUC, we do not invoke function ProcessRepliesIfAny. So I think last_processing and last_reply_timestamp will not be changed in this loop. Therefore I think about our use case, we should modify the condition of invoking WalSndKeepalive.(please refer to 0004-Simple-modification-of-timing.patch, and note that this is only a patch for testing). At the same time I modify the input of WalSndKeepalive from true to false. This is because when input is true, waiting_for_ping_response is set to true in WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the loop, so I think waiting_for_ping_response will not be reset to false and keepalive messages will not be sent. I tested after applying patches(0001 and 0004), I found the timeout was not printed in subscriber-side log. And the added messages "begin load changes" and "commit the log" were printed in publisher-side log: $ grep -ir "begin load changes" pub.log 2022-01-21 11:17:06.934 CST [2577699] LOG: begin load changes $ grep -ir "commit the log" pub.log 2022-01-21 11:21:15.564 CST [2577699] LOG: commit the log Attach the patches and test script mentioned above, in case someone wants to try. Regards, Wang wei
Вложения
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
>
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
> if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
> return;
>
> This will help us to rule out one possibility.
Thanks for your advices!
According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0
Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).
I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.
When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.
I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG: begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG: commit the log
Attach the patches and test script mentioned above, in case someone wants to
try.
Regards,
Wang wei
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
{
static TimestampTz sendTime = 0;
TimestampTz now = GetCurrentTimestamp();
ProcessRepliesIfAny();
WalSndKeepaliveIfNecessary();
/*
* Track lag no more than once per WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS to
* avoid flooding the lag tracker when we commit frequently.
*/
Thanks for your patch, it also works well when executing our use case, the timeout no longer appears in the logs. Is it necessary now to refine this patch and make as few changes as possible in order for it to be released?On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
>
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
> if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
> return;
>
> This will help us to rule out one possibility.
Thanks for your advices!
According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0
Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).
I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.
When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.
I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG: begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG: commit the log
Attach the patches and test script mentioned above, in case someone wants to
try.
Regards,
Wang wei
On Fri, Jan 21, 2022 at 10:45 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > I keep your patch 0001 and I add these two calls in function WalSndUpdateProgress without modifying WalSndKeepaliveIfNecessary,it works too. > What do your think of this patch? > I think this will also work. Here, the point was to just check what is the exact problem and the possible approach to solve it, the actual patch might be different from these ideas. So, let me try to summarize the problem and the possible approach to solve it so that others can also share their opinion. Here, the problem is that we don't send keep-alive messages for a long time while processing large transactions during logical replication where we don't send any data of such transactions (say because the table modified in the transaction is not published). We do try to send the keep_alive if necessary at the end of the transaction (via WalSndWriteData()) but by that time the subscriber-side can timeout and exit. Now, one idea to solve this problem could be that whenever we skip sending any change we do try to update the plugin progress via OutputPluginUpdateProgress(for walsender, it will invoke WalSndUpdateProgress), and there it tries to process replies and send keep_alive if necessary as we do when we send some data via OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I don't know whether it is a good idea to invoke such a mechanism for every change we skip to send or we should do it after we skip sending some threshold of continuous changes. I think later would be preferred. Also, we might want to introduce a new parameter send_keep_alive to this API so that there is flexibility to invoke this mechanism as we don't need to invoke it while we are actually sending data and before that, we just update the progress via this API. Thoughts? Note: I have added Simon and Petr J. to this thread as they introduced the API OutputPluginUpdateProgress in commit 024711bb54 and know this part of code/design well but ideas suggestions from everyone are welcome. -- With Regards, Amit Kapila.
On Thu, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Now, one idea to solve this problem could be that whenever we skip > sending any change we do try to update the plugin progress via > OutputPluginUpdateProgress(for walsender, it will invoke > WalSndUpdateProgress), and there it tries to process replies and send > keep_alive if necessary as we do when we send some data via > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I > don't know whether it is a good idea to invoke such a mechanism for > every change we skip to send or we should do it after we skip sending > some threshold of continuous changes. I think later would be > preferred. Also, we might want to introduce a new parameter > send_keep_alive to this API so that there is flexibility to invoke > this mechanism as we don't need to invoke it while we are actually > sending data and before that, we just update the progress via this > API. I tried out the patch according to your advice. I found if I invoke ProcessRepliesIfAny and WalSndKeepaliveIfNecessary in function OutputPluginUpdateProgress, the running time of the newly added function OutputPluginUpdateProgress invoked in pgoutput_change brings notable overhead: --11.34%--pgoutput_change | |--8.94%--OutputPluginUpdateProgress | | | --8.70%--WalSndUpdateProgress | | | |--7.44%--ProcessRepliesIfAny So I tried another way of sending keepalive message to the standby machine based on the timeout without asking for a reply(see attachment), the running time of the newly added function OutputPluginUpdateProgress invoked in pgoutput_change also brings slight overhead: --3.63%--pgoutput_change | |--1.40%--get_rel_sync_entry | | | --1.14%--hash_search | --1.08%--OutputPluginUpdateProgress | --0.85%--WalSndUpdateProgress Based on above, I think the second idea that sending some threshold of continuous changes might be better, I will do some research about this approach. Regards, Wang wei
Вложения
On Thu, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Now, one idea to solve this problem could be that whenever we skip
> sending any change we do try to update the plugin progress via
> OutputPluginUpdateProgress(for walsender, it will invoke
> WalSndUpdateProgress), and there it tries to process replies and send
> keep_alive if necessary as we do when we send some data via
> OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> don't know whether it is a good idea to invoke such a mechanism for
> every change we skip to send or we should do it after we skip sending
> some threshold of continuous changes. I think later would be
> preferred. Also, we might want to introduce a new parameter
> send_keep_alive to this API so that there is flexibility to invoke
> this mechanism as we don't need to invoke it while we are actually
> sending data and before that, we just update the progress via this
> API.
I tried out the patch according to your advice.
I found if I invoke ProcessRepliesIfAny and WalSndKeepaliveIfNecessary in
function OutputPluginUpdateProgress, the running time of the newly added
function OutputPluginUpdateProgress invoked in pgoutput_change brings notable
overhead:
--11.34%--pgoutput_change
|
|--8.94%--OutputPluginUpdateProgress
| |
| --8.70%--WalSndUpdateProgress
| |
| |--7.44%--ProcessRepliesIfAny
So I tried another way of sending keepalive message to the standby machine
based on the timeout without asking for a reply(see attachment), the running
time of the newly added function OutputPluginUpdateProgress invoked in
pgoutput_change also brings slight overhead:
--3.63%--pgoutput_change
|
|--1.40%--get_rel_sync_entry
| |
| --1.14%--hash_search
|
--1.08%--OutputPluginUpdateProgress
|
--0.85%--WalSndUpdateProgress
Based on above, I think the second idea that sending some threshold of
continuous changes might be better, I will do some research about this
approach.
Regards,
Wang wei
On Sat, Jan 28, 2022 at 19:36 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > shouldn't we use receiver_timeout in place of wal_sender_timeout because de > problem comes from the consummer. Thanks for your review. IMO, because it is a bug fix on the publisher-side, and the keepalive message is sent based on wal_sender_timeout in the existing code. So keep it consistent with the existing code. Regards, Wang wei
On Wed, Jan 26, 2022 at 11:37 AM I wrote: > On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Now, one idea to solve this problem could be that whenever we skip > > sending any change we do try to update the plugin progress via > > OutputPluginUpdateProgress(for walsender, it will invoke > > WalSndUpdateProgress), and there it tries to process replies and send > > keep_alive if necessary as we do when we send some data via > > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I > > don't know whether it is a good idea to invoke such a mechanism for > > every change we skip to send or we should do it after we skip sending > > some threshold of continuous changes. I think later would be > > preferred. Also, we might want to introduce a new parameter > > send_keep_alive to this API so that there is flexibility to invoke > > this mechanism as we don't need to invoke it while we are actually > > sending data and before that, we just update the progress via this > > API. > ...... > Based on above, I think the second idea that sending some threshold of > continuous changes might be better, I will do some research about this > approach. Based on the second idea, I wrote a new patch(see attachment). Regards, Wang wei
Вложения
Dear Wang, Thank you for making a patch. I applied your patch and confirmed that codes passed regression test. I put a short reviewing: ``` + static int skipped_changes_count = 0; + /* + * Conservatively, at least 150,000 changes can be skipped in 1s. + * + * Because we use half of wal_sender_timeout as the threshold, and the unit + * of wal_sender_timeout in process is ms, the final threshold is + * wal_sender_timeout * 75. + */ + int skipped_changes_threshold = wal_sender_timeout * 75; ``` I'm not sure but could you tell me the background of this calculation? Is this assumption reasonable? ``` @@ -654,20 +663,62 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, { case REORDER_BUFFER_CHANGE_INSERT: if (!relentry->pubactions.pubinsert) + { + if (++skipped_changes_count >= skipped_changes_threshold) + { + OutputPluginUpdateProgress(ctx, true); + + /* + * After sending keepalive message, reset + * skipped_changes_count. + */ + skipped_changes_count = 0; + } return; + } break; ``` Is the if-statement needed? In the walsender case OutputPluginUpdateProgress() leads WalSndUpdateProgress(), and the function also has the threshold for ping-ing. ``` static void -WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid) +WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool send_keep_alive) { - static TimestampTz sendTime = 0; + static TimestampTz trackTime = 0; TimestampTz now = GetCurrentTimestamp(); + if (send_keep_alive) + { + /* + * If half of wal_sender_timeout has lapsed without send message standby, + * send a keep-alive message to the standby. + */ + static TimestampTz sendTime = 0; + TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime, + wal_sender_timeout / 2); + if (now >= ping_time) + { + WalSndKeepalive(false); + + /* Try to flush pending output to the client */ + if (pq_flush_if_writable() != 0) + WalSndShutdown(); + sendTime = now; + } + } + ``` * +1 about renaming to trackTime. * `/2` might be magic number. How about following? Renaming is very welcome: ``` +#define WALSND_LOGICAL_PING_FACTOR 0.5 + static TimestampTz sendTime = 0; + TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime, + wal_sender_timeout * WALSND_LOGICAL_PING_FACTOR) ``` Could you add a commitfest entry for cfbot? Best Regards, Hayato Kuroda FUJITSU LIMITED
I added the definition extern in wal_sender_timeout; in the output_plugin.h file for compilation works.
I tested the patch for version 10 which is currently in production on our systems.
The functions below are only in master branch:
pgoutput_commit_prepared_txn,
pgoutput_rollback_prepared_txn,
pgoutput_stream_commit,
pgoutput_stream_prepare_txn
On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Now, one idea to solve this problem could be that whenever we skip
> > sending any change we do try to update the plugin progress via
> > OutputPluginUpdateProgress(for walsender, it will invoke
> > WalSndUpdateProgress), and there it tries to process replies and send
> > keep_alive if necessary as we do when we send some data via
> > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > don't know whether it is a good idea to invoke such a mechanism for
> > every change we skip to send or we should do it after we skip sending
> > some threshold of continuous changes. I think later would be
> > preferred. Also, we might want to introduce a new parameter
> > send_keep_alive to this API so that there is flexibility to invoke
> > this mechanism as we don't need to invoke it while we are actually
> > sending data and before that, we just update the progress via this
> > API.
> ......
> Based on above, I think the second idea that sending some threshold of
> continuous changes might be better, I will do some research about this
> approach.
Based on the second idea, I wrote a new patch(see attachment).
Regards,
Wang wei
On Tues, Feb 08, 2022 at 17:18 PM Kuroda, Hayato <kuroda.hayato@fujitsu.com> wrote: > I applied your patch and confirmed that codes passed regression test. > I put a short reviewing: Thanks for your test and review. > ``` > + static int skipped_changes_count = 0; > + /* > + * Conservatively, at least 150,000 changes can be skipped in 1s. > + * > + * Because we use half of wal_sender_timeout as the threshold, and > the unit > + * of wal_sender_timeout in process is ms, the final threshold is > + * wal_sender_timeout * 75. > + */ > + int skipped_changes_threshold = wal_sender_timeout * 75; > ``` > > I'm not sure but could you tell me the background of this calculation? > Is this assumption reasonable? According to our discussion, we need to send keepalive messages to subscriber when skipping changes. One approach is that **for each skipped change**, we try to send keepalive message by calculating whether a timeout will occur based on the current time and the last time the keepalive was sent. But this will brings slight overhead. So I want to try another approach: after **constantly skipping some changes**, we try to send keepalive message by calculating whether a timeout will occur based on the current time and the last time the keepalive was sent. IMO, we should send keepalive message after skipping a certain number of changes constantly. And I want to calculate the threshold dynamically by using a fixed value to avoid adding too much code. In addition, different users have different machine performance, and users can modify wal_sender_timeout, so the threshold should be dynamically calculated according to wal_sender_timeout. Based on these, I have tested on machines with different configurations. I took the test results on the machine with the lowest configuration. [results] The number of changes that can be skipped per second : 537087 (Average) To be safe, I set the value to 150000. (wal_sender_timeout / 2 / 1000 * 150000 = wal_sender_timeout * 75) The spec of the test server to get the threshold is: CPU information : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz Memert information : 816188 kB > ``` > @@ -654,20 +663,62 @@ pgoutput_change(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > { > case REORDER_BUFFER_CHANGE_INSERT: > if (!relentry->pubactions.pubinsert) > + { > + if (++skipped_changes_count >= > skipped_changes_threshold) > + { > + OutputPluginUpdateProgress(ctx, true); > + > + /* > + * After sending keepalive message, > reset > + * skipped_changes_count. > + */ > + skipped_changes_count = 0; > + } > return; > + } > break; > ``` > > Is the if-statement needed? In the walsender case > OutputPluginUpdateProgress() leads WalSndUpdateProgress(), and the > function also has the threshold for ping-ing. As mentioned above, we need to skip some changes continuously before calculating whether it will time out. If there is no if-statement here, every time a change is skipped, the timeout will be checked. This brings extra overhead. > ``` > static void > -WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, > TransactionId xid) > +WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, > +TransactionId xid, bool send_keep_alive) > { > - static TimestampTz sendTime = 0; > + static TimestampTz trackTime = 0; > TimestampTz now = GetCurrentTimestamp(); > > + if (send_keep_alive) > + { > + /* > + * If half of wal_sender_timeout has lapsed without send > message standby, > + * send a keep-alive message to the standby. > + */ > + static TimestampTz sendTime = 0; > + TimestampTz ping_time = > TimestampTzPlusMilliseconds(sendTime, > + > wal_sender_timeout / 2); > + if (now >= ping_time) > + { > + WalSndKeepalive(false); > + > + /* Try to flush pending output to the client */ > + if (pq_flush_if_writable() != 0) > + WalSndShutdown(); > + sendTime = now; > + } > + } > + > ``` > > * +1 about renaming to trackTime. > * `/2` might be magic number. How about following? Renaming is very welcome: > > ``` > +#define WALSND_LOGICAL_PING_FACTOR 0.5 > + static TimestampTz sendTime = 0; > + TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime, > + > +wal_sender_timeout * WALSND_LOGICAL_PING_FACTOR) > ``` In the existing code, similar operations on wal_sender_timeout use the style of (wal_sender_timeout / 2), e.g. function WalSndKeepaliveIfNecessary. So I think it should be consistent in this patch. But I think it is better to use magic number too. Maybe we could improve it in a new thread. > Could you add a commitfest entry for cfbot? Thanks for the reminder, I will add it soon. Regards, Wang wei
On Tue, Feb 8, 2022 at 1:59 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Wed, Jan 26, 2022 at 11:37 AM I wrote: > > On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Now, one idea to solve this problem could be that whenever we skip > > > sending any change we do try to update the plugin progress via > > > OutputPluginUpdateProgress(for walsender, it will invoke > > > WalSndUpdateProgress), and there it tries to process replies and send > > > keep_alive if necessary as we do when we send some data via > > > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I > > > don't know whether it is a good idea to invoke such a mechanism for > > > every change we skip to send or we should do it after we skip sending > > > some threshold of continuous changes. I think later would be > > > preferred. Also, we might want to introduce a new parameter > > > send_keep_alive to this API so that there is flexibility to invoke > > > this mechanism as we don't need to invoke it while we are actually > > > sending data and before that, we just update the progress via this > > > API. > > ...... > > Based on above, I think the second idea that sending some threshold of > > continuous changes might be better, I will do some research about this > > approach. > Based on the second idea, I wrote a new patch(see attachment). Hi Wang, Some comments: I see you only track skipped Inserts/Updates and Deletes. What about DDL operations that are skipped, what about truncate. What about changes made to unpublished tables? I wonder if you could create a test script that only did DDL operations and truncates, would this timeout happen? regards, Ajin Cherian Fujitsu Australia
On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote: > Some comments: Thanks for your review. > I see you only track skipped Inserts/Updates and Deletes. What about > DDL operations that are skipped, what about truncate. > What about changes made to unpublished tables? I wonder if you could > create a test script that only did DDL operations > and truncates, would this timeout happen? According to your suggestion, I tested with DDL and truncate. While testing, I ran only 20,000 DDLs and 10,000 truncations in one transaction. If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out. And if I use the default values, it will not time out. IMHO there should not be long transactions that only contain DDL and truncation. I'm not quite sure, do we need to handle this kind of use case? Attach the test details. [publisher-side] configure: wal_sender_timeout = 30s or 60s wal_receiver_timeout = 30s or 60s sql: create table tbl (a int primary key, b text); create table tbl2 (a int primary key, b text); create publication pub for table tbl; [subscriber-side] configure: wal_sender_timeout = 30s or 60s wal_receiver_timeout = 30s or 60s sql: create table tbl (a int primary key, b text);" create subscription sub connection 'dbname=postgres user=postgres' publication pub; [Execute sql in publisher-side] In a transaction, execute the following SQL 10,000 times in a loop: alter table tbl2 rename column b to c; truncate table tbl2; alter table tbl2 rename column c to b; Regards, Wang wei
On Tue, Feb 22, 2022 at 9:17 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Some comments: > Thanks for your review. > > > I see you only track skipped Inserts/Updates and Deletes. What about > > DDL operations that are skipped, what about truncate. > > What about changes made to unpublished tables? I wonder if you could > > create a test script that only did DDL operations > > and truncates, would this timeout happen? > According to your suggestion, I tested with DDL and truncate. > While testing, I ran only 20,000 DDLs and 10,000 truncations in one > transaction. > If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out. > And if I use the default values, it will not time out. > IMHO there should not be long transactions that only contain DDL and > truncation. I'm not quite sure, do we need to handle this kind of use case? > I think it is better to handle such cases as well and changes related to unpublished tables as well. BTW, it seems Kuroda-San has also given some comments [1] which I am not sure are addressed. I think instead of keeping the skipping threshold w.r.t wal_sender_timeout, we can use some conservative number like 10000 or so which we are sure won't impact performance and won't lead to timeouts. * + /* + * skipped_changes_count is reset when processing changes that do not need to + * be skipped. + */ + skipped_changes_count = 0 When the skipped_changes_count is reset, the sendTime should also be reset. Can we reset it whenever the UpdateProgress function is called with send_keep_alive as false? [1] - https://www.postgresql.org/message-id/TYAPR01MB5866BD2248EF82FF432FE599F52D9%40TYAPR01MB5866.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Dear Wang, Thank you for teaching some backgrounds about the patch. > According to our discussion, we need to send keepalive messages to subscriber > when skipping changes. > One approach is that **for each skipped change**, we try to send keepalive > message by calculating whether a timeout will occur based on the current time > and the last time the keepalive was sent. But this will brings slight overhead. > So I want to try another approach: after **constantly skipping some changes**, > we try to send keepalive message by calculating whether a timeout will occur > based on the current time and the last time the keepalive was sent. You meant that calling system calls like GetCurrentTimestamp() should be reduced, right? I'm not sure how it affects but it seems reasonable. > IMO, we should send keepalive message after skipping a certain number of > changes constantly. > And I want to calculate the threshold dynamically by using a fixed value to > avoid adding too much code. > In addition, different users have different machine performance, and users can > modify wal_sender_timeout, so the threshold should be dynamically calculated > according to wal_sender_timeout. Your experiment seems not bad, but the background cannot be understand from code comments. I prefer a static threshold because it's more simple, which as Amit said in the following too: https://www.postgresql.org/message-id/CAA4eK1%2B-p_K_j%3DNiGGD6tCYXiJH0ypT4REX5PBKJ4AcUoF3gZQ%40mail.gmail.com > In the existing code, similar operations on wal_sender_timeout use the style of > (wal_sender_timeout / 2), e.g. function WalSndKeepaliveIfNecessary. So I think > it should be consistent in this patch. > But I think it is better to use magic number too. Maybe we could improve it in > a new thread. I confirmed the code and +1 yours. We should treat it in another thread if needed. BTW, this patch cannot be applied to current master. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Feb 22, 2022 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for your review. > On Tue, Feb 22, 2022 at 9:17 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Fri, Feb 18, 2022 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > Some comments: > > Thanks for your review. > > > > > I see you only track skipped Inserts/Updates and Deletes. What about > > > DDL operations that are skipped, what about truncate. > > > What about changes made to unpublished tables? I wonder if you could > > > create a test script that only did DDL operations > > > and truncates, would this timeout happen? > > According to your suggestion, I tested with DDL and truncate. > > While testing, I ran only 20,000 DDLs and 10,000 truncations in one > > transaction. > > If I set wal_sender_timeout and wal_receiver_timeout to 30s, it will time out. > > And if I use the default values, it will not time out. > > IMHO there should not be long transactions that only contain DDL and > > truncation. I'm not quite sure, do we need to handle this kind of use case? > > > > I think it is better to handle such cases as well and changes related > to unpublished tables as well. BTW, it seems Kuroda-San has also given > some comments [1] which I am not sure are addressed. Add handling of related use cases. > I think instead of keeping the skipping threshold w.r.t > wal_sender_timeout, we can use some conservative number like 10000 or > so which we are sure won't impact performance and won't lead to > timeouts. Yes, it would be better. Set the threshold conservatively to 10000. > * > + /* > + * skipped_changes_count is reset when processing changes that do not need > to > + * be skipped. > + */ > + skipped_changes_count = 0 > > When the skipped_changes_count is reset, the sendTime should also be > reset. Can we reset it whenever the UpdateProgress function is called > with send_keep_alive as false? Fixed. Attached a new patch that addresses following improvements I have got so far as comments: 1. Consider other changes that need to be skipped(truncate, DDL and function calls pg_logical_emit_message). [suggestion by Ajin, Amit] (Add a new function SendKeepaliveIfNecessary for trying to send keepalive message.) 2. Set the threshold conservatively to a static value of 10000.[suggestion by Amit, Kuroda-San] 3. Reset sendTime in function WalSndUpdateProgress when send_keep_alive is false. [suggestion by Amit] Regards, Wang wei
Вложения
On Thur, Feb 24, 2022 at 4:06 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Dear Wang, Thanks for your review. > > According to our discussion, we need to send keepalive messages to > > subscriber when skipping changes. > > One approach is that **for each skipped change**, we try to send > > keepalive message by calculating whether a timeout will occur based on > > the current time and the last time the keepalive was sent. But this will brings > slight overhead. > > So I want to try another approach: after **constantly skipping some > > changes**, we try to send keepalive message by calculating whether a > > timeout will occur based on the current time and the last time the keepalive > was sent. > > You meant that calling system calls like GetCurrentTimestamp() should be > reduced, right? I'm not sure how it affects but it seems reasonable. Yes. There is no need to invoke frequently, and it will bring overhead. > > IMO, we should send keepalive message after skipping a certain number > > of changes constantly. > > And I want to calculate the threshold dynamically by using a fixed > > value to avoid adding too much code. > > In addition, different users have different machine performance, and > > users can modify wal_sender_timeout, so the threshold should be > > dynamically calculated according to wal_sender_timeout. > > Your experiment seems not bad, but the background cannot be understand > from code comments. I prefer a static threshold because it's more simple, which > as Amit said in the following too: > > https://www.postgresql.org/message-id/CAA4eK1%2B- > p_K_j%3DNiGGD6tCYXiJH0ypT4REX5PBKJ4AcUoF3gZQ%40mail.gmail.com Yes, you are right. Fixed. And I set the threshold to 10000. > BTW, this patch cannot be applied to current master. Thanks for reminder. Rebase it. Kindly have a look at new patch shared in [1]. [1] https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
Dear Wang, > Attached a new patch that addresses following improvements I have got so far as > comments: > 1. Consider other changes that need to be skipped(truncate, DDL and function > calls pg_logical_emit_message). [suggestion by Ajin, Amit] > (Add a new function SendKeepaliveIfNecessary for trying to send keepalive > message.) > 2. Set the threshold conservatively to a static value of 10000.[suggestion by Amit, > Kuroda-San] > 3. Reset sendTime in function WalSndUpdateProgress when send_keep_alive is > false. [suggestion by Amit] Thank you for giving a good patch! I'll check more detail later, but it can be applied my codes and passed check world. I put some minor comments: ``` + * Try to send keepalive message ``` Maybe missing "a"? ``` + /* + * After continuously skipping SKIPPED_CHANGES_THRESHOLD changes, try to send a + * keepalive message. + */ ``` This comments does not follow preferred style: https://www.postgresql.org/docs/devel/source-format.html ``` @@ -683,12 +683,12 @@ OutputPluginWrite(struct LogicalDecodingContext *ctx, bool last_write) * Update progress tracking (if supported). */ void -OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx) +OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool send_keep_alive) ``` This function is no longer doing just tracking. Could you update the code comment above? ``` if (!is_publishable_relation(relation)) return; ``` I'm not sure but it seems that the function exits immediately if relation is a sequence, view, temporary table and so on. Is it OK? Does it never happen? ``` + SendKeepaliveIfNecessary(ctx, false); ``` I think a comment is needed above which clarifies sending a keepalive message. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Feb 28, 2022 at 6:58 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Dear Wang, > > > Attached a new patch that addresses following improvements I have got > > so far as > > comments: > > 1. Consider other changes that need to be skipped(truncate, DDL and > > function calls pg_logical_emit_message). [suggestion by Ajin, Amit] > > (Add a new function SendKeepaliveIfNecessary for trying to send > > keepalive > > message.) > > 2. Set the threshold conservatively to a static value of > > 10000.[suggestion by Amit, Kuroda-San] 3. Reset sendTime in function > > WalSndUpdateProgress when send_keep_alive is false. [suggestion by > > Amit] > > Thank you for giving a good patch! I'll check more detail later, but it can be > applied my codes and passed check world. > I put some minor comments: Thanks for your comments. > ``` > + * Try to send keepalive message > ``` > > Maybe missing "a"? Fixed. Add missing "a". > ``` > + /* > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD changes, try > to send a > + * keepalive message. > + */ > ``` > > This comments does not follow preferred style: > https://www.postgresql.org/docs/devel/source-format.html Fixed. Correct wrong comment style. > ``` > @@ -683,12 +683,12 @@ OutputPluginWrite(struct LogicalDecodingContext *ctx, > bool last_write) > * Update progress tracking (if supported). > */ > void > -OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx) > +OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool > +send_keep_alive) > ``` > > This function is no longer doing just tracking. > Could you update the code comment above? Fixed. Update the comment above function OutputPluginUpdateProgress. > ``` > if (!is_publishable_relation(relation)) > return; > ``` > > I'm not sure but it seems that the function exits immediately if relation is a > sequence, view, temporary table and so on. Is it OK? Does it never happen? I did some checks to confirm this. After my confirmation, there are several situations that can cause a timeout. For example, if I insert many date into table sql_features in a long transaction, subscriber-side will time out. Although I think users should not modify these tables arbitrarily, it could happen. To be conservative, I think this use case should be addressed as well. Fixed. Invoke function SendKeepaliveIfNecessary before return. > ``` > + SendKeepaliveIfNecessary(ctx, false); > ``` > > I think a comment is needed above which clarifies sending a keepalive message. Fixed. Before invoking function SendKeepaliveIfNecessary, add the corresponding comment. Attach the new patch. [suggestion by Kuroda-San] 1. Fix the typo. 2. Improve comment style. 3. Fix missing consideration. 4. Add comments to clarifies above functions and function calls. Regards, Wang wei
Вложения
On Wed, Mar 2, 2022 at 1:06 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > ... > Attach the new patch. [suggestion by Kuroda-San] It is difficult to read the thread and to keep track of who reviewed what, and what patch is latest etc, when every patch name is the same. Can you please introduce a version number for future patch attachments? ------ KInd Regards, Peter Smith. Fujitsu Australia
Dear Wang, > Attach the new patch. [suggestion by Kuroda-San] > 1. Fix the typo. > 2. Improve comment style. > 3. Fix missing consideration. > 4. Add comments to clarifies above functions and function calls. Thank you for updating the patch! I confirmed they were fixed. ``` case REORDER_BUFFER_CHANGE_INVALIDATION: - /* Execute the invalidation messages locally */ - ReorderBufferExecuteInvalidations( - change->data.inval.ninvalidations, - change->data.inval.invalidations); - break; + { + LogicalDecodingContext *ctx = rb->private_data; + + Assert(!ctx->fast_forward); + + /* Set output state. */ + ctx->accept_writes = true; + ctx->write_xid = txn->xid; + ctx->write_location = change->lsn; ``` Some codes were added in ReorderBufferProcessTXN() for treating DDL, I'm also happy if you give the version number :-). Best Regards, Hayato Kuroda FUJITSU LIMITED > -----Original Message----- > From: Wang, Wei/王 威 <wangw.fnst@fujitsu.com> > Sent: Wednesday, March 2, 2022 11:06 AM > To: Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> > Cc: Fabrice Chapuis <fabrice636861@gmail.com>; Simon Riggs > <simon.riggs@enterprisedb.com>; Petr Jelinek > <petr.jelinek@enterprisedb.com>; Tang, Haiying/唐 海英 > <tanghy.fnst@fujitsu.com>; Amit Kapila <amit.kapila16@gmail.com>; > PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Ajin Cherian > <itsajin@gmail.com> > Subject: RE: Logical replication timeout problem > > On Mon, Feb 28, 2022 at 6:58 PM Kuroda, Hayato/黒田 隼人 > <kuroda.hayato@fujitsu.com> wrote: > > Dear Wang, > > > > > Attached a new patch that addresses following improvements I have got > > > so far as > > > comments: > > > 1. Consider other changes that need to be skipped(truncate, DDL and > > > function calls pg_logical_emit_message). [suggestion by Ajin, Amit] > > > (Add a new function SendKeepaliveIfNecessary for trying to send > > > keepalive > > > message.) > > > 2. Set the threshold conservatively to a static value of > > > 10000.[suggestion by Amit, Kuroda-San] 3. Reset sendTime in function > > > WalSndUpdateProgress when send_keep_alive is false. [suggestion by > > > Amit] > > > > Thank you for giving a good patch! I'll check more detail later, but it can be > > applied my codes and passed check world. > > I put some minor comments: > Thanks for your comments. > > > ``` > > + * Try to send keepalive message > > ``` > > > > Maybe missing "a"? > Fixed. Add missing "a". > > > ``` > > + /* > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD > changes, try > > to send a > > + * keepalive message. > > + */ > > ``` > > > > This comments does not follow preferred style: > > https://www.postgresql.org/docs/devel/source-format.html > Fixed. Correct wrong comment style. > > > ``` > > @@ -683,12 +683,12 @@ OutputPluginWrite(struct LogicalDecodingContext > *ctx, > > bool last_write) > > * Update progress tracking (if supported). > > */ > > void > > -OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx) > > +OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool > > +send_keep_alive) > > ``` > > > > This function is no longer doing just tracking. > > Could you update the code comment above? > Fixed. Update the comment above function OutputPluginUpdateProgress. > > > ``` > > if (!is_publishable_relation(relation)) > > return; > > ``` > > > > I'm not sure but it seems that the function exits immediately if relation is a > > sequence, view, temporary table and so on. Is it OK? Does it never happen? > I did some checks to confirm this. After my confirmation, there are several > situations that can cause a timeout. For example, if I insert many date into > table sql_features in a long transaction, subscriber-side will time out. > Although I think users should not modify these tables arbitrarily, it could > happen. To be conservative, I think this use case should be addressed as well. > Fixed. Invoke function SendKeepaliveIfNecessary before return. > > > ``` > > + SendKeepaliveIfNecessary(ctx, false); > > ``` > > > > I think a comment is needed above which clarifies sending a keepalive > message. > Fixed. Before invoking function SendKeepaliveIfNecessary, add the > corresponding > comment. > > Attach the new patch. [suggestion by Kuroda-San] > 1. Fix the typo. > 2. Improve comment style. > 3. Fix missing consideration. > 4. Add comments to clarifies above functions and function calls. > > Regards, > Wang wei
Dear Wang, > Some codes were added in ReorderBufferProcessTXN() for treating DDL, My mailer went wrong, so I'll put comments again. Sorry. Some codes were added in ReorderBufferProcessTXN() for treating DDL, but I doubted updating accept_writes is needed. IMU, the parameter is read by OutputPluginPrepareWrite() in order align messages. They should have a header - like 'w' - before their body. But here only a keepalive message is sent, no meaningful changes, so I think it might be not needed. I commented out the line and tested like you did [1], and no timeout and errors were found. Do you have any reasons for that? https://www.postgresql.org/message-id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Mar 4, 2022 at 4:26 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Thanks for your test and comments. > Some codes were added in ReorderBufferProcessTXN() for treating DDL, > but I doubted updating accept_writes is needed. > IMU, the parameter is read by OutputPluginPrepareWrite() in order align > messages. > They should have a header - like 'w' - before their body. But here only a > keepalive message is sent, > no meaningful changes, so I think it might be not needed. > I commented out the line and tested like you did [1], and no timeout and errors > were found. > Do you have any reasons for that? > > https://www.postgresql.org/message- > id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd0 > 1.prod.outlook.com Yes, you are right. We should not set accept_writes to true here. And I looked into the function WalSndUpdateProgress. I found function WalSndUpdateProgress try to record the time of some message(by function LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn. Then, when publisher receives the reply message from the subscriber(function ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate the delay time(refer to view pg_stat_replication). Referring to the purpose of LagTrackerWrite, I think it is no need to log time when sending keepalive messages here. So when the parameter send_keep_alive of function WalSndUpdateProgress is true, skip the recording time. > I'm also happy if you give the version number :-). Introduce version information, starting from version 1. Attach the new patch. 1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.] 2. Introduce version information.[suggestion by Peter, Kuroda-San] Regards, Wang wei
Вложения
On Tue, Mar 8, 2022 at 12:25 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > Attach the new patch. > 1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.] > 2. Introduce version information.[suggestion by Peter, Kuroda-San] > > Regards, > Wang wei Some comments. 1. The comment on top of SendKeepaliveIfNecessary Try to send a keepalive message if too many changes was skipped. change to Try to send a keepalive message if too many changes wer skipped. 2. In pgoutput_change: + /* Reset the counter for skipped changes. */ + SendKeepaliveIfNecessary(ctx, false); + This reset is called too early, this function might go on to skip changes because of the row filter, so this reset fits better once we know for sure that a change is sent out. You will also need to send keep alive when the change is skipped due to the row filter. regards, Ajin Cherian Fujitsu Australia
Hi, On Tue, Mar 8, 2022 at 10:25 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Fri, Mar 4, 2022 at 4:26 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > > > Thanks for your test and comments. > > > Some codes were added in ReorderBufferProcessTXN() for treating DDL, > > but I doubted updating accept_writes is needed. > > IMU, the parameter is read by OutputPluginPrepareWrite() in order align > > messages. > > They should have a header - like 'w' - before their body. But here only a > > keepalive message is sent, > > no meaningful changes, so I think it might be not needed. > > I commented out the line and tested like you did [1], and no timeout and errors > > were found. > > Do you have any reasons for that? > > > > https://www.postgresql.org/message- > > id/OS3PR01MB6275A95FD44DC6C46058EA389E3B9%40OS3PR01MB6275.jpnprd0 > > 1.prod.outlook.com > Yes, you are right. We should not set accept_writes to true here. > And I looked into the function WalSndUpdateProgress. I found function > WalSndUpdateProgress try to record the time of some message(by function > LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn. > Then, when publisher receives the reply message from the subscriber(function > ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate the > delay time(refer to view pg_stat_replication). > Referring to the purpose of LagTrackerWrite, I think it is no need to log time > when sending keepalive messages here. > So when the parameter send_keep_alive of function WalSndUpdateProgress is true, > skip the recording time. > > > I'm also happy if you give the version number :-). > Introduce version information, starting from version 1. > > Attach the new patch. > 1. Fix wrong variable setting and skip unnecessary time records.[suggestion by Kuroda-San and me.] > 2. Introduce version information.[suggestion by Peter, Kuroda-San] I've looked at the patch and have a question: +void +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) +{ + static int skipped_changes_count = 0; + + /* + * skipped_changes_count is reset when processing changes that do not + * need to be skipped. + */ + if (!skipped) + { + skipped_changes_count = 0; + return; + } + + /* + * After continuously skipping SKIPPED_CHANGES_THRESHOLD changes, try to send a + * keepalive message. + */ + #define SKIPPED_CHANGES_THRESHOLD 10000 + + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) + { + /* Try to send a keepalive message. */ + OutputPluginUpdateProgress(ctx, true); + + /* After trying to send a keepalive message, reset the flag. */ + skipped_changes_count = 0; + } +} Since we send a keepalive after continuously skipping 10000 changes, the originally reported issue can still occur if skipping 10000 changes took more than the timeout and the walsender didn't send any change while that, is that right? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Dear Wang, Thank you for updating the patch! Good self-reviewing. > And I looked into the function WalSndUpdateProgress. I found function > WalSndUpdateProgress try to record the time of some message(by function > LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn. Yeah, I think you are right. > Then, when publisher receives the reply message from the subscriber(function > ProcessStandbyReplyMessage), publisher invokes LagTrackerRead to calculate > the > delay time(refer to view pg_stat_replication). > Referring to the purpose of LagTrackerWrite, I think it is no need to log time > when sending keepalive messages here. > So when the parameter send_keep_alive of function WalSndUpdateProgress is > true, > skip the recording time. I also read them. LagTracker records the elapsed time between sending commit from publisher and receiving reply from subscriber, right? It seems good. Do we need adding a test for them? I think it can be added to 100_bugs.pl. Actually I tried to send PoC, but it does not finish to implement that. I'll send if it is done. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tues, Mar 8, 2022 at 11:54 PM Ajin Cherian <itsajin@gmail.com> wrote: > Some comments. Thanks for your comments. > 1. The comment on top of SendKeepaliveIfNecessary > > Try to send a keepalive message if too many changes was skipped. > > change to > > Try to send a keepalive message if too many changes wer skipped. Fixed. Change 'was' to 'were'. > 2. In pgoutput_change: > > + /* Reset the counter for skipped changes. */ > + SendKeepaliveIfNecessary(ctx, false); > + > > This reset is called too early, this function might go on to skip > changes because of the row filter, so this > reset fits better once we know for sure that a change is sent out. You > will also need to send keep alive > when the change is skipped due to the row filter. Fixed. Add a flag 'is_send' to record whether the change is sent, then reset the counter or try to send a keepalive message based on the flag 'is_send'. Attach the new patch. 1. Fix typo in comment on top of SendKeepaliveIfNecessary.[suggestion by Ajin.] 2. Add handling of cases filtered out by row filter.[suggestion by Ajin.] Regards, Wang wei
Вложения
On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've looked at the patch and have a question: Thanks for your review and comments. > +void > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) { > + static int skipped_changes_count = 0; > + > + /* > + * skipped_changes_count is reset when processing changes that do not > + * need to be skipped. > + */ > + if (!skipped) > + { > + skipped_changes_count = 0; > + return; > + } > + > + /* > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD > changes, try to send a > + * keepalive message. > + */ > + #define SKIPPED_CHANGES_THRESHOLD 10000 > + > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) > + { > + /* Try to send a keepalive message. */ > + OutputPluginUpdateProgress(ctx, true); > + > + /* After trying to send a keepalive message, reset the flag. */ > + skipped_changes_count = 0; > + } > +} > > Since we send a keepalive after continuously skipping 10000 changes, the > originally reported issue can still occur if skipping 10000 changes took more than > the timeout and the walsender didn't send any change while that, is that right? Yes, theoretically so. But after testing, I think this value should be conservative enough not to reproduce this bug. After the previous discussion[1], it is currently considered that it is better to directly set a conservative threshold than to calculate the threshold based on wal_sender_timeout. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
On Tue, Mar 8, 2022 at 4:48 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Thank you for updating the patch! Good self-reviewing. Thanks for your comments. > > And I looked into the function WalSndUpdateProgress. I found function > > WalSndUpdateProgress try to record the time of some message(by > > function > > LagTrackerWrite) sent to subscriber, such as in function pgoutput_commit_txn. > > Yeah, I think you are right. > > > Then, when publisher receives the reply message from the > > subscriber(function ProcessStandbyReplyMessage), publisher invokes > > LagTrackerRead to calculate the delay time(refer to view > > pg_stat_replication). > > Referring to the purpose of LagTrackerWrite, I think it is no need to > > log time when sending keepalive messages here. > > So when the parameter send_keep_alive of function WalSndUpdateProgress > > is true, skip the recording time. > > I also read them. LagTracker records the elapsed time between sending commit > from publisher and receiving reply from subscriber, right? It seems good. Yes. > Do we need adding a test for them? I think it can be added to 100_bugs.pl. > Actually I tried to send PoC, but it does not finish to implement that. > I'll send if it is done. I'm not sure if it is worth it. Because the reproduced test of this bug might take some time and might risk making the build farm slow, so I am not sure if others would like the reproduced test of this bug. Regards, Wang wei
Dear Wang, Thank you for updating! > > Do we need adding a test for them? I think it can be added to 100_bugs.pl. > > Actually I tried to send PoC, but it does not finish to implement that. > > I'll send if it is done. > I'm not sure if it is worth it. > Because the reproduced test of this bug might take some time and might risk > making the build farm slow, so I am not sure if others would like the > reproduced test of this bug. I was taught from you that it may suggest that it is difficult to stabilize and minimize the test. I withdraw the above. I put some comments for v2, mainly cosmetic ones. 1. pgoutput_change ``` + bool is_send = true; ``` My first impression is that is_send should be initialized to false, and it will change to true when OutputPluginWrite() is called. 2. pgoutput_change ``` + { + is_send = false; + break; + } ``` Here are too many indents, but I think they should be removed. See above comment. 3. WalSndUpdateProgress ``` + /* + * If half of wal_sender_timeout has lapsed without send message standby, + * send a keep-alive message to the standby. + */ ``` The comment seems inconsistency with others. Here is "keep-alive", but other parts are "keepalive". 4. ReorderBufferProcessTXN ``` + change->data.inval.ninvalidations, + change->data.inval.invalidations); ``` Maybe these lines break 80-columns rule. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Mar 9, 2022 at 11:26 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've looked at the patch and have a question: > Thanks for your review and comments. > > > +void > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) { > > + static int skipped_changes_count = 0; > > + > > + /* > > + * skipped_changes_count is reset when processing changes that do not > > + * need to be skipped. > > + */ > > + if (!skipped) > > + { > > + skipped_changes_count = 0; > > + return; > > + } > > + > > + /* > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD > > changes, try to send a > > + * keepalive message. > > + */ > > + #define SKIPPED_CHANGES_THRESHOLD 10000 > > + > > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) > > + { > > + /* Try to send a keepalive message. */ > > + OutputPluginUpdateProgress(ctx, true); > > + > > + /* After trying to send a keepalive message, reset the flag. */ > > + skipped_changes_count = 0; > > + } > > +} > > > > Since we send a keepalive after continuously skipping 10000 changes, the > > originally reported issue can still occur if skipping 10000 changes took more than > > the timeout and the walsender didn't send any change while that, is that right? > Yes, theoretically so. > But after testing, I think this value should be conservative enough not to reproduce > this bug. But it really depends on the workload, the server condition, and the timeout value, right? The logical decoding might involve disk I/O much to spill/load intermediate data and the system might be under the high-load condition. Why don't we check both the count and the time? That is, I think we can send a keep-alive either if we skipped 10000 changes or if we didn't sent anything for wal_sender_timeout / 2. Also, the patch changes the current behavior of wal senders; with the patch, we send keep-alive messages even when wal_sender_timeout = 0. But I'm not sure it's a good idea. The subscriber's wal_receiver_timeout might be lower than wal_sender_timeout. Instead, I think it's better to periodically check replies and send a reply to the keep-alive message sent from the subscriber if necessary, for example, every 10000 skipped changes. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
changes or if we didn't sent anything for wal_sender_timeout / 2"
On Wed, Mar 9, 2022 at 11:26 AM wangw.fnst@fujitsu.com
<wangw.fnst@fujitsu.com> wrote:
>
> On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've looked at the patch and have a question:
> Thanks for your review and comments.
>
> > +void
> > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > + static int skipped_changes_count = 0;
> > +
> > + /*
> > + * skipped_changes_count is reset when processing changes that do not
> > + * need to be skipped.
> > + */
> > + if (!skipped)
> > + {
> > + skipped_changes_count = 0;
> > + return;
> > + }
> > +
> > + /*
> > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > changes, try to send a
> > + * keepalive message.
> > + */
> > + #define SKIPPED_CHANGES_THRESHOLD 10000
> > +
> > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > + {
> > + /* Try to send a keepalive message. */
> > + OutputPluginUpdateProgress(ctx, true);
> > +
> > + /* After trying to send a keepalive message, reset the flag. */
> > + skipped_changes_count = 0;
> > + }
> > +}
> >
> > Since we send a keepalive after continuously skipping 10000 changes, the
> > originally reported issue can still occur if skipping 10000 changes took more than
> > the timeout and the walsender didn't send any change while that, is that right?
> Yes, theoretically so.
> But after testing, I think this value should be conservative enough not to reproduce
> this bug.
But it really depends on the workload, the server condition, and the
timeout value, right? The logical decoding might involve disk I/O much
to spill/load intermediate data and the system might be under the
high-load condition. Why don't we check both the count and the time?
That is, I think we can send a keep-alive either if we skipped 10000
changes or if we didn't sent anything for wal_sender_timeout / 2.
Also, the patch changes the current behavior of wal senders; with the
patch, we send keep-alive messages even when wal_sender_timeout = 0.
But I'm not sure it's a good idea. The subscriber's
wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
I think it's better to periodically check replies and send a reply to
the keep-alive message sent from the subscriber if necessary, for
example, every 10000 skipped changes.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Wed, Mar 9, 2022 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thanks for your comments. > On Wed, Mar 9, 2022 at 10:26 AM I wrote: > > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > > I've looked at the patch and have a question: > > Thanks for your review and comments. > > > > > +void > > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) { > > > + static int skipped_changes_count = 0; > > > + > > > + /* > > > + * skipped_changes_count is reset when processing changes that do > not > > > + * need to be skipped. > > > + */ > > > + if (!skipped) > > > + { > > > + skipped_changes_count = 0; > > > + return; > > > + } > > > + > > > + /* > > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD > > > changes, try to send a > > > + * keepalive message. > > > + */ > > > + #define SKIPPED_CHANGES_THRESHOLD 10000 > > > + > > > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) > > > + { > > > + /* Try to send a keepalive message. */ > > > + OutputPluginUpdateProgress(ctx, true); > > > + > > > + /* After trying to send a keepalive message, reset the flag. */ > > > + skipped_changes_count = 0; > > > + } > > > +} > > > > > > Since we send a keepalive after continuously skipping 10000 changes, the > > > originally reported issue can still occur if skipping 10000 changes took more > than > > > the timeout and the walsender didn't send any change while that, is that > right? > > Yes, theoretically so. > > But after testing, I think this value should be conservative enough not to > reproduce > > this bug. > > But it really depends on the workload, the server condition, and the > timeout value, right? The logical decoding might involve disk I/O much > to spill/load intermediate data and the system might be under the > high-load condition. Why don't we check both the count and the time? > That is, I think we can send a keep-alive either if we skipped 10000 > changes or if we didn't sent anything for wal_sender_timeout / 2. Yes, you are right. Do you mean that when skipping every change, check if it has been more than (wal_sender_timeout / 2) without sending anything? IIUC, I tried to send keep-alive messages based on time before[1], but after testing, I found that it will brings slight overhead. So I am not sure, in a function(pgoutput_change) that is invoked frequently, should this kind of overhead be introduced? > Also, the patch changes the current behavior of wal senders; with the > patch, we send keep-alive messages even when wal_sender_timeout = 0. > But I'm not sure it's a good idea. The subscriber's > wal_receiver_timeout might be lower than wal_sender_timeout. Instead, > I think it's better to periodically check replies and send a reply to > the keep-alive message sent from the subscriber if necessary, for > example, every 10000 skipped changes. Sorry, I could not follow what you said. I am not sure, do you mean the following? 1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped 10000 changes continuously, we will invoke the function WalSndKeepalive in the function WalSndUpdateProgress, and send a keepalive message to the subscriber with requesting an immediate reply. 2. If after sending a keepalive message, and then 10000 changes are skipped continuously again. In this case, we need to handle the reply from the subscriber-side when processing the 10000th change. The handling approach is to reply to the confirmation message from the subscriber. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com Please let me know if I understand wrong. Regards, Wang wei
On Wed, Mar 16, 2022 at 11:57 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Wed, Mar 9, 2022 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Thanks for your comments. > > > On Wed, Mar 9, 2022 at 10:26 AM I wrote: > > > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > > > > I've looked at the patch and have a question: > > > Thanks for your review and comments. > > > > > > > +void > > > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) { > > > > + static int skipped_changes_count = 0; > > > > + > > > > + /* > > > > + * skipped_changes_count is reset when processing changes that do > > not > > > > + * need to be skipped. > > > > + */ > > > > + if (!skipped) > > > > + { > > > > + skipped_changes_count = 0; > > > > + return; > > > > + } > > > > + > > > > + /* > > > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD > > > > changes, try to send a > > > > + * keepalive message. > > > > + */ > > > > + #define SKIPPED_CHANGES_THRESHOLD 10000 > > > > + > > > > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) > > > > + { > > > > + /* Try to send a keepalive message. */ > > > > + OutputPluginUpdateProgress(ctx, true); > > > > + > > > > + /* After trying to send a keepalive message, reset the flag. */ > > > > + skipped_changes_count = 0; > > > > + } > > > > +} > > > > > > > > Since we send a keepalive after continuously skipping 10000 changes, the > > > > originally reported issue can still occur if skipping 10000 changes took more > > than > > > > the timeout and the walsender didn't send any change while that, is that > > right? > > > Yes, theoretically so. > > > But after testing, I think this value should be conservative enough not to > > reproduce > > > this bug. > > > > But it really depends on the workload, the server condition, and the > > timeout value, right? The logical decoding might involve disk I/O much > > to spill/load intermediate data and the system might be under the > > high-load condition. Why don't we check both the count and the time? > > That is, I think we can send a keep-alive either if we skipped 10000 > > changes or if we didn't sent anything for wal_sender_timeout / 2. > Yes, you are right. > Do you mean that when skipping every change, check if it has been more than > (wal_sender_timeout / 2) without sending anything? > IIUC, I tried to send keep-alive messages based on time before[1], but after > testing, I found that it will brings slight overhead. So I am not sure, in a > function(pgoutput_change) that is invoked frequently, should this kind of > overhead be introduced? > > > Also, the patch changes the current behavior of wal senders; with the > > patch, we send keep-alive messages even when wal_sender_timeout = 0. > > But I'm not sure it's a good idea. The subscriber's > > wal_receiver_timeout might be lower than wal_sender_timeout. Instead, > > I think it's better to periodically check replies and send a reply to > > the keep-alive message sent from the subscriber if necessary, for > > example, every 10000 skipped changes. > Sorry, I could not follow what you said. I am not sure, do you mean the > following? > 1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped > 10000 changes continuously, we will invoke the function WalSndKeepalive in the > function WalSndUpdateProgress, and send a keepalive message to the subscriber > with requesting an immediate reply. > 2. If after sending a keepalive message, and then 10000 changes are skipped > continuously again. In this case, we need to handle the reply from the > subscriber-side when processing the 10000th change. The handling approach is to > reply to the confirmation message from the subscriber. After more thought, can we check only wal_sender_timeout without skip-count? That is, in WalSndUpdateProgress(), if we have received any reply from the subscriber in last (wal_sender_timeout / 2), we don't need to do anything in terms of keep-alive. If not, we do ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then WalSndKeepalivesIfNecessary(). That way, we can send keep-alive messages every (wal_sender_timeout / 2). And since we don't call them for every change, we would not need to worry about the overhead much. Actually, WalSndWriteData() does similar things; even in the case where we don't skip consecutive changes (i.e., sending consecutive changes to the subscriber), we do ProcessRepliesIfAny() at least every (wal_sender_timeout / 2). I think this would work in most common cases where the user sets both wal_sender_timeout and wal_receiver_timeout to the same value. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 16, 2022 at 11:57 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > But it really depends on the workload, the server condition, and the > > > timeout value, right? The logical decoding might involve disk I/O much > > > to spill/load intermediate data and the system might be under the > > > high-load condition. Why don't we check both the count and the time? > > > That is, I think we can send a keep-alive either if we skipped 10000 > > > changes or if we didn't sent anything for wal_sender_timeout / 2. > > Yes, you are right. > > Do you mean that when skipping every change, check if it has been more than > > (wal_sender_timeout / 2) without sending anything? > > IIUC, I tried to send keep-alive messages based on time before[1], but after > > testing, I found that it will brings slight overhead. So I am not sure, in a > > function(pgoutput_change) that is invoked frequently, should this kind of > > overhead be introduced? > > > > > Also, the patch changes the current behavior of wal senders; with the > > > patch, we send keep-alive messages even when wal_sender_timeout = 0. > > > But I'm not sure it's a good idea. The subscriber's > > > wal_receiver_timeout might be lower than wal_sender_timeout. Instead, > > > I think it's better to periodically check replies and send a reply to > > > the keep-alive message sent from the subscriber if necessary, for > > > example, every 10000 skipped changes. > > Sorry, I could not follow what you said. I am not sure, do you mean the > > following? > > 1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped > > 10000 changes continuously, we will invoke the function WalSndKeepalive in the > > function WalSndUpdateProgress, and send a keepalive message to the subscriber > > with requesting an immediate reply. > > 2. If after sending a keepalive message, and then 10000 changes are skipped > > continuously again. In this case, we need to handle the reply from the > > subscriber-side when processing the 10000th change. The handling approach is to > > reply to the confirmation message from the subscriber. > > After more thought, can we check only wal_sender_timeout without > skip-count? That is, in WalSndUpdateProgress(), if we have received > any reply from the subscriber in last (wal_sender_timeout / 2), we > don't need to do anything in terms of keep-alive. If not, we do > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive > messages every (wal_sender_timeout / 2). And since we don't call them > for every change, we would not need to worry about the overhead much. > But won't that lead to a call to GetCurrentTimestamp() for each change we skip? IIUC from previous replies that lead to a slight slowdown in previous tests of Wang-San. > Actually, WalSndWriteData() does similar things; > That also every time seems to be calling GetCurrentTimestamp(). I think it might be okay when we are sending the change but not sure if the overhead of the same is negligible when we are skipping the changes. -- With Regards, Amit Kapila.
On Thu, Mar 17, 2022 at 12:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > After more thought, can we check only wal_sender_timeout without > > skip-count? That is, in WalSndUpdateProgress(), if we have received > > any reply from the subscriber in last (wal_sender_timeout / 2), we > > don't need to do anything in terms of keep-alive. If not, we do > > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then > > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive > > messages every (wal_sender_timeout / 2). And since we don't call them > > for every change, we would not need to worry about the overhead much. > > > > But won't that lead to a call to GetCurrentTimestamp() for each change > we skip? IIUC from previous replies that lead to a slight slowdown in > previous tests of Wang-San. > If the above is true then I think we can use a lower skip_count say 10 along with a timeout mechanism to send keepalive message. This will help us to alleviate the overhead Wang-San has shown. BTW, I think there could be one other advantage of using ProcessRepliesIfAny() (as you are suggesting) is that it can help to release sync waiters if there are any. I feel that would be the case for the skip_empty_transactions patch [1] which uses WalSndUpdateProgress to send keepalive messages after skipping empty transactions. [1] - https://www.postgresql.org/message-id/CAFPTHDYvRSyT5ppYSPsH4Ozs0_W62-nffu0%3DmY1%2BsVipF%3DUN-g%40mail.gmail.com -- With Regards, Amit Kapila.
On Thu, Mar 17, 2022 at 7:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Mar 17, 2022 at 12:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > After more thought, can we check only wal_sender_timeout without > > > skip-count? That is, in WalSndUpdateProgress(), if we have received > > > any reply from the subscriber in last (wal_sender_timeout / 2), we > > > don't need to do anything in terms of keep-alive. If not, we do > > > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then > > > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive > > > messages every (wal_sender_timeout / 2). And since we don't call them > > > for every change, we would not need to worry about the overhead much. > > > > > > > But won't that lead to a call to GetCurrentTimestamp() for each change > > we skip? IIUC from previous replies that lead to a slight slowdown in > > previous tests of Wang-San. > > > If the above is true then I think we can use a lower skip_count say 10 > along with a timeout mechanism to send keepalive message. This will > help us to alleviate the overhead Wang-San has shown. Using both sounds reasonable to me. I'd like to see how much the overhead is alleviated by using skip_count 10 (or 100). > BTW, I think there could be one other advantage of using > ProcessRepliesIfAny() (as you are suggesting) is that it can help to > release sync waiters if there are any. I feel that would be the case > for the skip_empty_transactions patch [1] which uses > WalSndUpdateProgress to send keepalive messages after skipping empty > transactions. +1 Regards, [1] https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thanks for your comments. > On Thu, Mar 17, 2022 at 7:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Mar 17, 2022 at 12:27 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Wed, Mar 16, 2022 at 7:38 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > > > After more thought, can we check only wal_sender_timeout without > > > > skip-count? That is, in WalSndUpdateProgress(), if we have > > > > received any reply from the subscriber in last (wal_sender_timeout > > > > / 2), we don't need to do anything in terms of keep-alive. If not, > > > > we do > > > > ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then > > > > WalSndKeepalivesIfNecessary(). That way, we can send keep-alive > > > > messages every (wal_sender_timeout / 2). And since we don't call > > > > them for every change, we would not need to worry about the overhead > much. > > > > > > > > > > But won't that lead to a call to GetCurrentTimestamp() for each > > > change we skip? IIUC from previous replies that lead to a slight > > > slowdown in previous tests of Wang-San. > > > > > If the above is true then I think we can use a lower skip_count say 10 > > along with a timeout mechanism to send keepalive message. This will > > help us to alleviate the overhead Wang-San has shown. > > Using both sounds reasonable to me. I'd like to see how much the overhead is > alleviated by using skip_count 10 (or 100). > > > BTW, I think there could be one other advantage of using > > ProcessRepliesIfAny() (as you are suggesting) is that it can help to > > release sync waiters if there are any. I feel that would be the case > > for the skip_empty_transactions patch [1] which uses > > WalSndUpdateProgress to send keepalive messages after skipping empty > > transactions. > > +1 I modified the patch according to your and Amit-San's suggestions. In addition, after testing, I found that when the threshold is 10, it brings slight overhead. So I try to change it to 100, after testing, the results look good to me. 10 : 1.22%--UpdateProgress 100 : 0.16%--UpdateProgress Please refer to attachment. Attach the new patch. 1. Refactor the way to send keepalive messages. [suggestion by Sawada-San, Amit-San.] 2. Modify the value of flag is_send initialization to make it look more reasonable. [suggestion by Kuroda-San.] 3. Improve new function names. (From SendKeepaliveIfNecessary to UpdateProgress.) Regards, Wang wei
Вложения
On Thu, Mar 9, 2022 at 11:52 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Thank you for updating! Thanks for your comments. > 1. pgoutput_change > ``` > + bool is_send = true; > ``` > > My first impression is that is_send should be initialized to false, and it will change > to true when OutputPluginWrite() is called. > > > 2. pgoutput_change > ``` > + { > + is_send = false; > + break; > + } > ``` > > Here are too many indents, but I think they should be removed. > See above comment. Fixed. Initialize is_send to false. > 3. WalSndUpdateProgress > ``` > + /* > + * If half of wal_sender_timeout has lapsed without send message > standby, > + * send a keep-alive message to the standby. > + */ > ``` > > The comment seems inconsistency with others. > Here is "keep-alive", but other parts are "keepalive". Since this part of the code was refactored, this inconsistent comment was removed. > 4. ReorderBufferProcessTXN > ``` > + change- > >data.inval.ninvalidations, > + > + change->data.inval.invalidations); > ``` > > Maybe these lines break 80-columns rule. Thanks for reminder. I will run pg_ident later. Kindly have a look at new patch shared in [1]. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275C67F14954E05CE5D04399E139%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
On Fri, Mar 18, 2022 at 10:43 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Attach the new patch. > * case REORDER_BUFFER_CHANGE_INVALIDATION: - /* Execute the invalidation messages locally */ - ReorderBufferExecuteInvalidations( - change->data.inval.ninvalidations, - change->data.inval.invalidations); - break; + { + LogicalDecodingContext *ctx = rb->private_data; + + /* Try to send a keepalive message. */ + UpdateProgress(ctx, true); Calling UpdateProgress() here appears adhoc to me especially because it calls OutputPluginUpdateProgress which appears to be called only from plugin API. Am, I missing something? Also why the same handling is missed in other similar messages like REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call any plug-in API? I am not sure what is a good way to achieve this but one idea that occurred to me was shall we invent a new callback ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and then pgoutput can register its API where we can have the logic similar to what you have in UpdateProgress()? If we do so, then all the cuurent callers of UpdateProgress in pgoutput can also call that API. What do you think? * Why don't you have a quick exit like below code in WalSndWriteData? /* Try taking fast path unless we get too close to walsender timeout. */ if (now < TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout / 2) && !pq_is_send_pending()) { return; } * Can we rename variable 'is_send' to 'change_sent'? -- With Regards, Amit Kapila.
On Fri, Mar 18, 2022 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 18, 2022 at 10:43 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > Attach the new patch. > > > > * > case REORDER_BUFFER_CHANGE_INVALIDATION: > - /* Execute the invalidation messages locally */ > - ReorderBufferExecuteInvalidations( > - change->data.inval.ninvalidations, > - change->data.inval.invalidations); > - break; > + { > + LogicalDecodingContext *ctx = rb->private_data; > + > + /* Try to send a keepalive message. */ > + UpdateProgress(ctx, true); > > Calling UpdateProgress() here appears adhoc to me especially because > it calls OutputPluginUpdateProgress which appears to be called only > from plugin API. Am, I missing something? Also why the same handling > is missed in other similar messages like > REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call any > plug-in API? > > I am not sure what is a good way to achieve this but one idea that > occurred to me was shall we invent a new callback > ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and > then pgoutput can register its API where we can have the logic similar > to what you have in UpdateProgress()? If we do so, then all the > cuurent callers of UpdateProgress in pgoutput can also call that API. > What do you think? > Another idea could be that we leave the DDL case for now as anyway there is very less chance of timeout for skipping DDLs and we may later need to even backpatch this bug-fix which would be another reason to not make such invasive changes. We can handle the DDL case if required separately. -- With Regards, Amit Kapila.
On Mon, Mar 21, 2022 at 1:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for your comments. > On Fri, Mar 18, 2022 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Mar 18, 2022 at 10:43 AM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > > > > > On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > > > > > Attach the new patch. > > > > > > > * > > case REORDER_BUFFER_CHANGE_INVALIDATION: > > - /* Execute the invalidation messages locally */ > > - ReorderBufferExecuteInvalidations( > > - change->data.inval.ninvalidations, > > - change->data.inval.invalidations); > > - break; > > + { > > + LogicalDecodingContext *ctx = rb->private_data; > > + > > + /* Try to send a keepalive message. */ > > + UpdateProgress(ctx, true); > > > > Calling UpdateProgress() here appears adhoc to me especially because > > it calls OutputPluginUpdateProgress which appears to be called only > > from plugin API. Am, I missing something? Also why the same handling > > is missed in other similar messages like > > REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call > any > > plug-in API? Yes, you are right. And I invoke in case REORDER_BUFFER_CHANGE_INVALIDATION because I think every DDL will modify the catalog then get into this case. So I only invoke function UpdateProgress here to handle DDL. > > I am not sure what is a good way to achieve this but one idea that > > occurred to me was shall we invent a new callback > > ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and > > then pgoutput can register its API where we can have the logic similar > > to what you have in UpdateProgress()? If we do so, then all the > > cuurent callers of UpdateProgress in pgoutput can also call that API. > > What do you think? > > > Another idea could be that we leave the DDL case for now as anyway > there is very less chance of timeout for skipping DDLs and we may > later need to even backpatch this bug-fix which would be another > reason to not make such invasive changes. We can handle the DDL case > if required separately. Yes, I think a new callback function would be nice. Yes, as you said, maybe we could fix the usecase that found the problem in the first place. Then make further modifications on the master branch. Modify the patch. Currently only DML related code remains. > > * Why don't you have a quick exit like below code in WalSndWriteData? > > /* Try taking fast path unless we get too close to walsender timeout. */ if (now > > < TimestampTzPlusMilliseconds(last_reply_timestamp, > > wal_sender_timeout / 2) && > > !pq_is_send_pending()) > > { > > return; > > } Fixed. I missed this so adding it in the new patch. > > * Can we rename variable 'is_send' to 'change_sent'? Improve the the name of this variable.(From 'is_send' to 'change_sent') Attach the new patch. [suggestion by Amit-San.] 1. Remove DDL related code. Handle the DDL case later separately if need. 2. Fix a missing.(In function WalSndUpdateProgress) 3. Improve variable names. (From 'is_send' to 'change_sent') 4. Fix some comments.(Above and inside the function WalSndUpdateProgress.) Regards, Wang wei
Вложения
On Tue, Mar 22, 2022 at 7:25 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > Attach the new patch. > It seems by mistake you have removed the changes from pgoutput_message and pgoutput_truncate functions. I have added those back. Additionally, I made a few other changes: (a) moved the function UpdateProgress to pgoutput.c as it is not used outside it, (b) change the new parameter in plugin API from 'send_keep_alive' to 'last_write' to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c) made a number of changes in WalSndUpdateProgress API, it is better to move keep-alive code after lag track code because we do process replies at that time and there it will compute the lag; (d) changed/added comments in the code. Do let me know what you think of the attached? -- With Regards, Amit Kapila.
Вложения
Dear Amit, > It seems by mistake you have removed the changes from pgoutput_message > and pgoutput_truncate functions. I have added those back. > Additionally, I made a few other changes: (a) moved the function > UpdateProgress to pgoutput.c as it is not used outside it, (b) change > the new parameter in plugin API from 'send_keep_alive' to 'last_write' > to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c) > made a number of changes in WalSndUpdateProgress API, it is better to > move keep-alive code after lag track code because we do process > replies at that time and there it will compute the lag; (d) > changed/added comments in the code. LGTM, but the patch cannot be applied to current HEAD. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thur, Mar 24, 2022 at 6:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for your kindly update. > It seems by mistake you have removed the changes from pgoutput_message > and pgoutput_truncate functions. I have added those back. > Additionally, I made a few other changes: (a) moved the function > UpdateProgress to pgoutput.c as it is not used outside it, (b) change > the new parameter in plugin API from 'send_keep_alive' to 'last_write' > to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c) > made a number of changes in WalSndUpdateProgress API, it is better to > move keep-alive code after lag track code because we do process > replies at that time and there it will compute the lag; (d) > changed/added comments in the code. > > Do let me know what you think of the attached? It looks good to me. Just rebase it because the change in header(75b1521). I tested it and the result looks good to me. Attach the new patch. Regards, Wang wei
Вложения
On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Thur, Mar 24, 2022 at 6:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Thanks for your kindly update. > > > It seems by mistake you have removed the changes from pgoutput_message > > and pgoutput_truncate functions. I have added those back. > > Additionally, I made a few other changes: (a) moved the function > > UpdateProgress to pgoutput.c as it is not used outside it, (b) change > > the new parameter in plugin API from 'send_keep_alive' to 'last_write' > > to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c) > > made a number of changes in WalSndUpdateProgress API, it is better to > > move keep-alive code after lag track code because we do process > > replies at that time and there it will compute the lag; (d) > > changed/added comments in the code. > > > > Do let me know what you think of the attached? > It looks good to me. Just rebase it because the change in header(75b1521). > I tested it and the result looks good to me. Since commit 75b1521 added decoding of sequence to logical replication, the patch needs to have pgoutput_sequence() call update_progress(). Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Mar 25, 2022 at 11:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > Since commit 75b1521 added decoding of sequence to logical > replication, the patch needs to have pgoutput_sequence() call > update_progress(). > Yeah, I also think this needs to be addressed. But apart from this, I want to know your and other's opinion on the following two points: a. Both this and the patch discussed in the nearby thread [1] add an additional parameter to WalSndUpdateProgress/OutputPluginUpdateProgress and it seems to me that both are required. The additional parameter 'last_write' added by this patch indicates: "If the last write is skipped then try (if we are close to wal_sender_timeout) to send a keepalive message to the receiver to avoid timeouts.". This means it can be used after any 'write' message. OTOH, the parameter 'skipped_xact' added by another patch [1] indicates if we have skipped sending anything for a transaction then sendkeepalive for synchronous replication to avoid any delays in such a transaction. Does this sound reasonable or can you think of a better way to deal with it? b. Do we want to backpatch the patch in this thread? I am reluctant to backpatch because it changes the exposed API which can have an impact and second there exists a workaround (user can increase wal_sender_timeout/wal_receiver_timeout). [1] - https://www.postgresql.org/message-id/OS0PR01MB5716BB24409D4B69206615B1941A9%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Fri, Mar 25, 2022 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Thur, Mar 24, 2022 at 6:32 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > Thanks for your kindly update. > > > > > It seems by mistake you have removed the changes from > pgoutput_message > > > and pgoutput_truncate functions. I have added those back. > > > Additionally, I made a few other changes: (a) moved the function > > > UpdateProgress to pgoutput.c as it is not used outside it, (b) change > > > the new parameter in plugin API from 'send_keep_alive' to 'last_write' > > > to make it look similar to WalSndPrepareWrite and WalSndWriteData, (c) > > > made a number of changes in WalSndUpdateProgress API, it is better to > > > move keep-alive code after lag track code because we do process > > > replies at that time and there it will compute the lag; (d) > > > changed/added comments in the code. > > > > > > Do let me know what you think of the attached? > > It looks good to me. Just rebase it because the change in header(75b1521). > > I tested it and the result looks good to me. > > Since commit 75b1521 added decoding of sequence to logical > replication, the patch needs to have pgoutput_sequence() call > update_progress(). Thanks for your comments. Yes, you are right. Add missing handling of pgoutput_sequence. Attach the new patch. Regards, Wang wei
Вложения
Dear Wang-san, Thank you for updating! ...but it also cannot be applied to current HEAD because of the commit 923def9a533. Your patch seems to conflict the adding an argument of logicalrep_write_insert(). It allows specifying columns to publish by skipping some columns in logicalrep_write_tuple() which is called from logicalrep_write_insert() and logicalrep_write_update(). Do we have to consider something special case for that? I thought timeout may occur if users have huge table and publish few columns, but it is corner case. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Mar 28, 2022 at 9:56 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > Dear Wang-san, Thanks for your comments. > Thank you for updating! > ...but it also cannot be applied to current HEAD > because of the commit 923def9a533. > > Your patch seems to conflict the adding an argument of > logicalrep_write_insert(). > It allows specifying columns to publish by skipping some columns in > logicalrep_write_tuple() > which is called from logicalrep_write_insert() and logicalrep_write_update(). Thank for your kindly reminder. Rebase the patch. > Do we have to consider something special case for that? > I thought timeout may occur if users have huge table and publish few columns, > but it is corner case. I think maybe we do not need to deal with this use case. The maximum number of table columns allowed by PG is 1600 (macro MaxHeapAttributeNumber), and after loop through all columns in the function logicalrep_write_tuple, the function OutputPluginWrite will be invoked immediately to actually send the data to the subscriber. This refreshes the last time the subscriber received a message. So I think this loop will not cause timeout issues. Regards, Wang wei
Вложения
On Mon, Mar 28, 2022 at 11:41 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Mon, Mar 28, 2022 at 9:56 AM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > > > Do we have to consider something special case for that? > > I thought timeout may occur if users have huge table and publish few columns, > > but it is corner case. > I think maybe we do not need to deal with this use case. > The maximum number of table columns allowed by PG is 1600 > (macro MaxHeapAttributeNumber), and after loop through all columns in the > function logicalrep_write_tuple, the function OutputPluginWrite will be invoked > immediately to actually send the data to the subscriber. This refreshes the > last time the subscriber received a message. > So I think this loop will not cause timeout issues. > Right, I also don't think it can be a source of timeout. -- With Regards, Amit Kapila.
Dear Amit, Wang, > > I think maybe we do not need to deal with this use case. > > The maximum number of table columns allowed by PG is 1600 > > (macro MaxHeapAttributeNumber), and after loop through all columns in the > > function logicalrep_write_tuple, the function OutputPluginWrite will be invoked > > immediately to actually send the data to the subscriber. This refreshes the > > last time the subscriber received a message. > > So I think this loop will not cause timeout issues. > > > > Right, I also don't think it can be a source of timeout. OK. I have no comments for this version. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Mar 28, 2022 at 2:11 AM I wrote: > Rebase the patch. After reviewing anohter patch[1], I think this patch should also add a loop in function WalSndUpdateProgress like what did in function WalSndWriteData. So update the patch to be consistent with the existing code and the patch mentioned above. Attach the new patch. [1] - https://www.postgresql.org/message-id/OS0PR01MB5716946347F607F4CFB02FCE941D9%40OS0PR01MB5716.jpnprd01.prod.outlook.com Regards, Wang wei
Вложения
On Fri, Mar 25, 2022 at 5:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 25, 2022 at 11:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 25, 2022 at 2:23 PM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > > > Since commit 75b1521 added decoding of sequence to logical > > replication, the patch needs to have pgoutput_sequence() call > > update_progress(). > > > > Yeah, I also think this needs to be addressed. But apart from this, I > want to know your and other's opinion on the following two points: > a. Both this and the patch discussed in the nearby thread [1] add an > additional parameter to > WalSndUpdateProgress/OutputPluginUpdateProgress and it seems to me > that both are required. The additional parameter 'last_write' added by > this patch indicates: "If the last write is skipped then try (if we > are close to wal_sender_timeout) to send a keepalive message to the > receiver to avoid timeouts.". This means it can be used after any > 'write' message. OTOH, the parameter 'skipped_xact' added by another > patch [1] indicates if we have skipped sending anything for a > transaction then sendkeepalive for synchronous replication to avoid > any delays in such a transaction. Does this sound reasonable or can > you think of a better way to deal with it? These current approaches look good to me. > b. Do we want to backpatch the patch in this thread? I am reluctant to > backpatch because it changes the exposed API which can have an impact > and second there exists a workaround (user can increase > wal_sender_timeout/wal_receiver_timeout). Yeah, we should avoid API changes between minor versions. I feel it's better to fix it also for back-branches but probably we need another fix for them. The issue reported on this thread seems quite confusable; it looks like a network problem but is not true. Also, the user who faced this issue has to increase wal_sender_timeout due to the decoded data size, which also means to delay detecting network problems. It seems an unrelated trade-off. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tues, Mar 29, 2022 at 9:45 AM I wrote: > Attach the new patch. Rebase the patch because the commit d5a9d86d in current HEAD. Regards, Wang wei
Вложения
On Wed, Mar 30, 2022 at 1:24 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Tues, Mar 29, 2022 at 9:45 AM I wrote: > > Attach the new patch. > > Rebase the patch because the commit d5a9d86d in current HEAD. > Thanks, this looks good to me apart from a minor indentation change which I'll take care of before committing. I am planning to push this day after tomorrow on Friday unless there are any other major comments. -- With Regards, Amit Kapila.
On Wed, Mar 30, 2022 3:54 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > Rebase the patch because the commit d5a9d86d in current HEAD. > Thanks for your patch, I tried this patch and confirmed that there is no timeout problem after applying this patch, and I could reproduce this problem on HEAD. Regards, Shi yu
On Wed, Mar 30, 2022 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 30, 2022 at 1:24 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Tues, Mar 29, 2022 at 9:45 AM I wrote: > > > Attach the new patch. > > > > Rebase the patch because the commit d5a9d86d in current HEAD. > > > > Thanks, this looks good to me apart from a minor indentation change > which I'll take care of before committing. I am planning to push this > day after tomorrow on Friday unless there are any other major > comments. The patch basically looks good to me. But the only concern to me is that once we get the patch committed, we will have to call update_progress() at all paths in callbacks that process changes. Which seems poor maintainability. On the other hand, possible another solution would be to add a new callback that is called e.g., every 1000 changes so that walsender does its job such as timeout handling while processing the decoded data in reorderbuffer.c. The callback is set only if the walsender does logical decoding, otherwise NULL. With this idea, other plugins will also be able to benefit without changes. But I’m not really sure it’s a good design, and adding a new callback introduces complexity. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Mar 31, 2022 at 5:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Mar 30, 2022 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Mar 30, 2022 at 1:24 PM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > > > > > On Tues, Mar 29, 2022 at 9:45 AM I wrote: > > > > Attach the new patch. > > > > > > Rebase the patch because the commit d5a9d86d in current HEAD. > > > > > > > Thanks, this looks good to me apart from a minor indentation change > > which I'll take care of before committing. I am planning to push this > > day after tomorrow on Friday unless there are any other major > > comments. > > The patch basically looks good to me. But the only concern to me is > that once we get the patch committed, we will have to call > update_progress() at all paths in callbacks that process changes. > Which seems poor maintainability. > > On the other hand, possible another solution would be to add a new > callback that is called e.g., every 1000 changes so that walsender > does its job such as timeout handling while processing the decoded > data in reorderbuffer.c. The callback is set only if the walsender > does logical decoding, otherwise NULL. With this idea, other plugins > will also be able to benefit without changes. But I’m not really sure > it’s a good design, and adding a new callback introduces complexity. > Yeah, same here. I have also mentioned another way to expose an API from reorderbuffer [1] by introducing a skip API but just not sure if that or this API is generic enough to make it adding worth. Also, note that the current patch makes the progress recording of large transactions somewhat better when most of the changes are skipped. We can further extend it to make it true for other cases as well but that probably can be done separately if required as that is not required for this bug-fix. I intend to commit this patch today but I think it is better to wait for a few more days to see if anybody has any opinion on this matter. I'll push this on Tuesday unless we decide to do something different here. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com -- With Regards, Amit Kapila.
The patch basically looks good to me. But the only concern to me isthat once we get the patch committed, we will have to callupdate_progress() at all paths in callbacks that process changes.Which seems poor maintainability.
On the other hand, possible another solution would be to add a newcallback that is called e.g., every 1000 changes so that walsenderdoes its job such as timeout handling while processing the decodeddata in reorderbuffer.c. The callback is set only if the walsenderdoes logical decoding, otherwise NULL. With this idea, other pluginswill also be able to benefit without changes. But I’m not really sureit’s a good design, and adding a new callback introduces complexity.
On Fri, Apr 1, 2022 at 7:33 AM Euler Taveira <euler@eulerto.com> wrote: > > On Thu, Mar 31, 2022, at 9:24 AM, Masahiko Sawada wrote: > > On the other hand, possible another solution would be to add a new > callback that is called e.g., every 1000 changes so that walsender > does its job such as timeout handling while processing the decoded > data in reorderbuffer.c. The callback is set only if the walsender > does logical decoding, otherwise NULL. With this idea, other plugins > will also be able to benefit without changes. But I’m not really sure > it’s a good design, and adding a new callback introduces complexity. > > No new callback is required. > > In the current code, each output plugin callback is responsible to call > OutputPluginUpdateProgress. It is up to the output plugin author to add calls > to this function. The lack of a call in a callback might cause issues like what > was described in the initial message. > This is exactly our initial analysis and we have tried a patch on these lines and it has a noticeable overhead. See [1]. Calling this for each change or each skipped change can bring noticeable overhead that is why we decided to call it after a certain threshold (100) of skipped changes. Now, surely as mentioned in my previous reply we can make it generic such that instead of calling this (update_progress function as in the patch) for skipped cases, we call it always. Will that make it better? > The functions CreateInitDecodingContext and CreateDecodingContext receives the > update_progress function as a parameter. These functions are called in 2 > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b) > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses > WalSndUpdateProgress as a progress function. Case (b) does not have one because > it is not required -- local decoding/communication. There is no custom update > progress routine for each output plugin which leads me to the question: > couldn't we encapsulate the update progress call into the callback functions? > Sorry, I don't get your point. What exactly do you mean by this? AFAIS, currently we call this output plugin API in pgoutput functions only, do you intend to get it invoked from a different place? [1] - https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
This is exactly our initial analysis and we have tried a patch onthese lines and it has a noticeable overhead. See [1]. Calling thisfor each change or each skipped change can bring noticeable overheadthat is why we decided to call it after a certain threshold (100) ofskipped changes. Now, surely as mentioned in my previous reply we canmake it generic such that instead of calling this (update_progressfunction as in the patch) for skipped cases, we call it always. Willthat make it better?
> The functions CreateInitDecodingContext and CreateDecodingContext receives the> update_progress function as a parameter. These functions are called in 2> places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)> SQL logical decoding functions (pg_logical_*_changes). Case (a) uses> WalSndUpdateProgress as a progress function. Case (b) does not have one because> it is not required -- local decoding/communication. There is no custom update> progress routine for each output plugin which leads me to the question:> couldn't we encapsulate the update progress call into the callback functions?>Sorry, I don't get your point. What exactly do you mean by this?AFAIS, currently we call this output plugin API in pgoutput functionsonly, do you intend to get it invoked from a different place?
On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote: > > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote: > > This is exactly our initial analysis and we have tried a patch on > these lines and it has a noticeable overhead. See [1]. Calling this > for each change or each skipped change can bring noticeable overhead > that is why we decided to call it after a certain threshold (100) of > skipped changes. Now, surely as mentioned in my previous reply we can > make it generic such that instead of calling this (update_progress > function as in the patch) for skipped cases, we call it always. Will > that make it better? > > That's what I have in mind but using a different approach. > > > The functions CreateInitDecodingContext and CreateDecodingContext receives the > > update_progress function as a parameter. These functions are called in 2 > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b) > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses > > WalSndUpdateProgress as a progress function. Case (b) does not have one because > > it is not required -- local decoding/communication. There is no custom update > > progress routine for each output plugin which leads me to the question: > > couldn't we encapsulate the update progress call into the callback functions? > > > > Sorry, I don't get your point. What exactly do you mean by this? > AFAIS, currently we call this output plugin API in pgoutput functions > only, do you intend to get it invoked from a different place? > > It seems I didn't make myself clear. The callbacks I'm referring to the > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a > *_cb_wrapper() function, we have something like: > > if (ctx->progress & PGOUTPUT_PROGRESS_FOO) > NewUpdateProgress(ctx, false); > > The NewUpdateProgress function would contain a logic similar to the > update_progress() from the proposed patch. (A different function name here just > to avoid confusion.) > > The output plugin is responsible to set ctx->progress with the callback > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) that we would > like to run NewUpdateProgress. > This sounds like a conflicting approach to what we currently do. Currently, OutputPluginUpdateProgress() is called from the xact related pgoutput functions like pgoutput_commit_txn(), pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we follow what you are saying then for some of the APIs like pgoutput_change/_message/_truncate, we need to set the parameter to invoke NewUpdateProgress() which will internally call OutputPluginUpdateProgress(), and for the remaining APIs, we will call in the corresponding pgoutput_* function. I feel if we want to make it more generic than the current patch, it is better to directly call what you are referring to here as NewUpdateProgress() in all remaining APIs like pgoutput_change/_truncate, etc. -- With Regards, Amit Kapila.
On Fri, Apr 1, 2022 at 12:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote: > > > > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote: > > > > This is exactly our initial analysis and we have tried a patch on > > these lines and it has a noticeable overhead. See [1]. Calling this > > for each change or each skipped change can bring noticeable overhead > > that is why we decided to call it after a certain threshold (100) of > > skipped changes. Now, surely as mentioned in my previous reply we can > > make it generic such that instead of calling this (update_progress > > function as in the patch) for skipped cases, we call it always. Will > > that make it better? > > > > That's what I have in mind but using a different approach. > > > > > The functions CreateInitDecodingContext and CreateDecodingContext > receives the > > > update_progress function as a parameter. These functions are called in 2 > > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and > (b) > > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses > > > WalSndUpdateProgress as a progress function. Case (b) does not have one > because > > > it is not required -- local decoding/communication. There is no custom > update > > > progress routine for each output plugin which leads me to the question: > > > couldn't we encapsulate the update progress call into the callback functions? > > > > > > > Sorry, I don't get your point. What exactly do you mean by this? > > AFAIS, currently we call this output plugin API in pgoutput functions > > only, do you intend to get it invoked from a different place? > > > > It seems I didn't make myself clear. The callbacks I'm referring to the > > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a > > *_cb_wrapper() function, we have something like: > > > > if (ctx->progress & PGOUTPUT_PROGRESS_FOO) > > NewUpdateProgress(ctx, false); > > > > The NewUpdateProgress function would contain a logic similar to the > > update_progress() from the proposed patch. (A different function name here > just > > to avoid confusion.) > > > > The output plugin is responsible to set ctx->progress with the callback > > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) > that we would > > like to run NewUpdateProgress. > > > > This sounds like a conflicting approach to what we currently do. > Currently, OutputPluginUpdateProgress() is called from the xact > related pgoutput functions like pgoutput_commit_txn(), > pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we > follow what you are saying then for some of the APIs like > pgoutput_change/_message/_truncate, we need to set the parameter to > invoke NewUpdateProgress() which will internally call > OutputPluginUpdateProgress(), and for the remaining APIs, we will call > in the corresponding pgoutput_* function. I feel if we want to make it > more generic than the current patch, it is better to directly call > what you are referring to here as NewUpdateProgress() in all remaining > APIs like pgoutput_change/_truncate, etc. Thanks for your comments. According to your suggestion, improve the patch to make it more generic. Attach the new patch. Regards, Wang wei
Вложения
On Wed, Apr 6, 2022 at 11:09 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Fri, Apr 1, 2022 at 12:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote: > > > > > > It seems I didn't make myself clear. The callbacks I'm referring to the > > > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a > > > *_cb_wrapper() function, we have something like: > > > > > > if (ctx->progress & PGOUTPUT_PROGRESS_FOO) > > > NewUpdateProgress(ctx, false); > > > > > > The NewUpdateProgress function would contain a logic similar to the > > > update_progress() from the proposed patch. (A different function name here > > just > > > to avoid confusion.) > > > > > > The output plugin is responsible to set ctx->progress with the callback > > > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) > > that we would > > > like to run NewUpdateProgress. > > > > > > > This sounds like a conflicting approach to what we currently do. > > Currently, OutputPluginUpdateProgress() is called from the xact > > related pgoutput functions like pgoutput_commit_txn(), > > pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we > > follow what you are saying then for some of the APIs like > > pgoutput_change/_message/_truncate, we need to set the parameter to > > invoke NewUpdateProgress() which will internally call > > OutputPluginUpdateProgress(), and for the remaining APIs, we will call > > in the corresponding pgoutput_* function. I feel if we want to make it > > more generic than the current patch, it is better to directly call > > what you are referring to here as NewUpdateProgress() in all remaining > > APIs like pgoutput_change/_truncate, etc. > Thanks for your comments. > > According to your suggestion, improve the patch to make it more generic. > Attach the new patch. > typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct LogicalDecodingContext *lr, XLogRecPtr Ptr, TransactionId xid, - bool skipped_xact + bool skipped_xact, + bool last_write In this approach, I don't think we need an additional parameter last_write. Let's do the work related to keepalive without a parameter, do you see any problem with that? Also, let's try to evaluate how it impacts lag functionality for large transactions? -- With Regards, Amit Kapila.
On Wed, Apr 6, 2022 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 6, 2022 at 11:09 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > According to your suggestion, improve the patch to make it more generic. > > Attach the new patch. > > > > typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct > LogicalDecodingContext *lr, > XLogRecPtr Ptr, > TransactionId xid, > - bool skipped_xact > + bool skipped_xact, > + bool last_write > > In this approach, I don't think we need an additional parameter > last_write. Let's do the work related to keepalive without a > parameter, do you see any problem with that? > I think this patch doesn't take into account that we call OutputPluginUpdateProgress() from APIs like pgoutput_commit_txn(). I think we should always call the new function update_progress from those existing call sites and arrange the function such that when called from xact end APIs like pgoutput_commit_txn(), it always call OutputPluginUpdateProgress and make changes_count as 0. -- With Regards, Amit Kapila.
On Wed, Apr 6, 2022 at 1:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for your comments. > typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct > LogicalDecodingContext *lr, > XLogRecPtr Ptr, > TransactionId xid, > - bool skipped_xact > + bool skipped_xact, > + bool last_write > > In this approach, I don't think we need an additional parameter last_write. Let's > do the work related to keepalive without a parameter, do you see any problem > with that? I agree with you. Modify this point. > I think this patch doesn't take into account that we call > OutputPluginUpdateProgress() from APIs like pgoutput_commit_txn(). I > think we should always call the new function update_progress from > those existing call sites and arrange the function such that when > called from xact end APIs like pgoutput_commit_txn(), it always call > OutputPluginUpdateProgress and make changes_count as 0. Improve it. Add two new input to function update_progress.(skipped_xact and end_xact). Modify the function invoke from OutputPluginUpdateProgress to update_progress. > Also, let's try to evaluate how it impacts lag functionality for large transactions? I think this patch will not affect lag functionality. It will updates the lag field of view pg_stat_replication more frequently. IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of change and invoking time in lag_tracker. Then when invoking function ProcessStandbyReplyMessage, it will calculate the lag field according to the message from subscriber and the information in lag_tracker. This patch does not modify this logic, but only increases the frequency of invoking. Please let me know if I understand wrong. Attach the new patch. 1. Remove the new function input parameters in this patch(parameter last_write of WalSndUpdateProgress). [suggestion by Amit-San] 2. Also invoke function update_progress in other xact end APIs like pgoutput_commit_txn. [suggestion by Amit-San] Regards, Wang wei
Вложения
On Wed, Apr 6, 2022 at 6:30 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Wed, Apr 6, 2022 at 1:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Thanks for your comments. > > > typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct > > LogicalDecodingContext *lr, > > XLogRecPtr Ptr, > > TransactionId xid, > > - bool skipped_xact > > + bool skipped_xact, > > + bool last_write > > > > In this approach, I don't think we need an additional parameter last_write. Let's > > do the work related to keepalive without a parameter, do you see any problem > > with that? > I agree with you. Modify this point. > > > I think this patch doesn't take into account that we call > > OutputPluginUpdateProgress() from APIs like pgoutput_commit_txn(). I > > think we should always call the new function update_progress from > > those existing call sites and arrange the function such that when > > called from xact end APIs like pgoutput_commit_txn(), it always call > > OutputPluginUpdateProgress and make changes_count as 0. > Improve it. > Add two new input to function update_progress.(skipped_xact and end_xact). > Modify the function invoke from OutputPluginUpdateProgress to update_progress. > > > Also, let's try to evaluate how it impacts lag functionality for large transactions? > I think this patch will not affect lag functionality. It will updates the lag > field of view pg_stat_replication more frequently. > IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of > change and invoking time in lag_tracker. Then when invoking function > ProcessStandbyReplyMessage, it will calculate the lag field according to the > message from subscriber and the information in lag_tracker. This patch does > not modify this logic, but only increases the frequency of invoking. > Please let me know if I understand wrong. > No, your understanding seems correct to me. But what I want to check is if calling the progress function more often has any impact on lag-related fields in pg_stat_replication? I think you need to check the impact of large transaction replay. One comment: +static void +update_progress(LogicalDecodingContext *ctx, bool skipped_xact, bool end_xact) +{ + static int changes_count = 0; + + if (end_xact) + { + /* Update progress tracking at xact end. */ + OutputPluginUpdateProgress(ctx, skipped_xact); + changes_count = 0; + } + /* + * After continuously processing CHANGES_THRESHOLD changes, update progress + * which will also try to send a keepalive message if required. I think you can simply return after making changes_count = 0. There should be an empty line before starting the next comment. -- With Regards, Amit Kapila.
On Wed, Apr 7, 2022 at 1:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for your comments. > One comment: > +static void > +update_progress(LogicalDecodingContext *ctx, bool skipped_xact, bool > end_xact) > +{ > + static int changes_count = 0; > + > + if (end_xact) > + { > + /* Update progress tracking at xact end. */ > + OutputPluginUpdateProgress(ctx, skipped_xact); > + changes_count = 0; > + } > + /* > + * After continuously processing CHANGES_THRESHOLD changes, update > progress > + * which will also try to send a keepalive message if required. > > I think you can simply return after making changes_count = 0. There > should be an empty line before starting the next comment. Improve as suggested. BTW, there is a conflict in current HEAD when applying v12 because of the commit 2c7ea57. Also rebase it. Attach the new patch. 1. Make some improvements to the new function update_progress. [suggestion by Amit-San] 2. Rebase the patch because the commit 2c7ea57 in current HEAD. Regards, Wang wei
Вложения
On Wed, Apr 7, 2022 at 1:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 6, 2022 at 6:30 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Wed, Apr 6, 2022 at 1:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Also, let's try to evaluate how it impacts lag functionality for large > transactions? > > I think this patch will not affect lag functionality. It will updates the lag > > field of view pg_stat_replication more frequently. > > IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of > > change and invoking time in lag_tracker. Then when invoking function > > ProcessStandbyReplyMessage, it will calculate the lag field according to the > > message from subscriber and the information in lag_tracker. This patch does > > not modify this logic, but only increases the frequency of invoking. > > Please let me know if I understand wrong. > > > > No, your understanding seems correct to me. But what I want to check > is if calling the progress function more often has any impact on > lag-related fields in pg_stat_replication? I think you need to check > the impact of large transaction replay. Thanks for the explanation. After doing some checks, I found that the v13 patch makes the calculations of lag functionality inaccurate. In short, v13 patch lets us try to track lag more frequently and try to send a keepalive message to subscribers. But in order to prevent flooding the lag tracker, we could not track lag more than once within WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress). This means we may lose informations that needs to be tracked. For example, suppose there is a large transaction with lsn from lsn1 to lsn3. In HEAD, when we calculate the lag time for lsn3, the lag time of lsn3 is (now - lsn3.time). But with v13 patch, when we calculate the lag time for lsn3, because there maybe no informations of lsn3 but has informations of lsn2 in lag_tracker, the lag time of lsn3 is (now - t2.time). (see function LagTrackerRead) Therefore, if we lose the informations that need to be tracked, the lag time becomes large and inaccurate. So I skip tracking lag during a transaction just like the current HEAD. Attach the new patch. Regards, Wang wei
Вложения
On Mon, Apr 11, 2022 at 2:39 PM I wrote: > Attach the new patch. Also, share test results and details. To check that the lsn information used for the calculation is what we expected, I get some information by adding logs in the function LagTrackerRead. Summary of test results: - In current HEAD and current HEAD with v14 patch, we could found the information of same lsn as received from subscriber-side in lag_tracker. - In current HEAD with v13 patch, we could hardly found the information of same lsn in lag_tracker. Attach the details: [The log by HEAD] the lsn we received from subscriber | the lsn whose time we used to calculate in lag_tracker 382826584 | 382826584 743884840 | 743884840 1104943232 | 1104943232 1468949424 | 1468949424 1469521216 | 1469521216 [The log by HEAD with v14 patch] the lsn we received from subscriber | the lsn whose time we used to calculate in lag_tracker 382826584 | 382826584 743890672 | 743890672 1105074264 | 1105074264 1469127040 | 1469127040 1830591240 | 1830591240 [The log by HEAD with v13 patch] the lsn we received from subscriber | the lsn whose time we used to calculate in lag_tracker 382826584 | 359848728 743884840 | 713808560 1105010640 | 1073978544 1468517536 | 1447850160 1469516328 | 1469516328 Regards, Wang wei
On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > So I skip tracking lag during a transaction just like the current HEAD. > Attach the new patch. > Thanks, please find the updated patch where I have slightly modified the comments. Sawada-San, Euler, do you have any opinion on this approach? I personally still prefer the approach implemented in v10 [1] especially due to the latest finding by Wang-San that we can't update the lag-tracker apart from when it is invoked at the transaction end. However, I am fine if we like this approach more. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275E0C2B4D9E488AD7CBA209E1F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Вложения
On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > So I skip tracking lag during a transaction just like the current HEAD. > > Attach the new patch. > > > > Thanks, please find the updated patch where I have slightly modified > the comments. > > Sawada-San, Euler, do you have any opinion on this approach? I > personally still prefer the approach implemented in v10 [1] especially > due to the latest finding by Wang-San that we can't update the > lag-tracker apart from when it is invoked at the transaction end. > However, I am fine if we like this approach more. Thank you for updating the patch. The current patch looks much better than v10 which requires to call to update_progress() every path. Regarding v15 patch, I'm concerned a bit that the new function name, update_progress(), is too generic. How about update_replation_progress() or something more specific name? --- + if (end_xact) + { + /* Update progress tracking at xact end. */ + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); + changes_count = 0; + return; + } + + /* + * After continuously processing CHANGES_THRESHOLD changes, we try to send + * a keepalive message if required. + * + * We don't want to try sending a keepalive message after processing each + * change as that can have overhead. Testing reveals that there is no + * noticeable overhead in doing it after continuously processing 100 or so + * changes. + */ +#define CHANGES_THRESHOLD 100 + if (++changes_count >= CHANGES_THRESHOLD) + { + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); + changes_count = 0; + } Can we merge two if branches since we do the same things? Or did you separate them for better readability? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com<wangw.fnst@fujitsu.com> wrote:>> So I skip tracking lag during a transaction just like the current HEAD.> Attach the new patch.>Thanks, please find the updated patch where I have slightly modifiedthe comments.Sawada-San, Euler, do you have any opinion on this approach? Ipersonally still prefer the approach implemented in v10 [1] especiallydue to the latest finding by Wang-San that we can't update thelag-tracker apart from when it is invoked at the transaction end.However, I am fine if we like this approach more.
On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote: > > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote: > > Sawada-San, Euler, do you have any opinion on this approach? I > personally still prefer the approach implemented in v10 [1] especially > due to the latest finding by Wang-San that we can't update the > lag-tracker apart from when it is invoked at the transaction end. > However, I am fine if we like this approach more. > > It seems v15 is simpler and less error prone than v10. v10 has a mix of > OutputPluginUpdateProgress() and the new function update_progress(). The v10 > also calls update_progress() for every change action in pgoutput_change(). It > is not a good approach for maintainability -- new changes like sequences need > extra calls. > Okay, let's use the v15 approach as Sawada-San also seems to have a preference for that. > However, as you mentioned there should handle the track lag case. > > Both patches change the OutputPluginUpdateProgress() so it cannot be > backpatched. Are you planning to backpatch it? If so, the boolean variable > (last_write or end_xacts depending of which version you are considering) could > be added to LogicalDecodingContext. > If we add it to LogicalDecodingContext then I think we have to always reset the variable after its use which will make it look ugly and error-prone. I was not thinking to backpatch it because of the API change but I guess if we want to backpatch then we can add it to LogicalDecodingContext for back-branches. I am not sure if that will look committable but surely we can try. > (You should probably consider this approach > for skipped_xact too) > As mentioned, I think it will be more error-prone and we already have other xact related parameters in that and similar APIs. So, I am not sure why you want to prefer that? > > Does this same issue occur for long transactions? I mean keep a long > transaction open and execute thousands of transactions. > No, this problem won't happen for such cases because we will only try to send it at the commit time. Note that this problem happens only when we don't send anything to the subscriber till a timeout happens. -- With Regards, Amit Kapila.
On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > > > > > So I skip tracking lag during a transaction just like the current HEAD. > > > Attach the new patch. > > > > > > > Thanks, please find the updated patch where I have slightly modified > > the comments. > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > personally still prefer the approach implemented in v10 [1] especially > > due to the latest finding by Wang-San that we can't update the > > lag-tracker apart from when it is invoked at the transaction end. > > However, I am fine if we like this approach more. > > Thank you for updating the patch. > > The current patch looks much better than v10 which requires to call to > update_progress() every path. > > Regarding v15 patch, I'm concerned a bit that the new function name, > update_progress(), is too generic. How about > update_replation_progress() or something more specific name? > Do you intend to say update_replication_progress()? The word 'replation' doesn't make sense to me. I am fine with this suggestion. > > --- > + if (end_xact) > + { > + /* Update progress tracking at xact end. */ > + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); > + changes_count = 0; > + return; > + } > + > + /* > + * After continuously processing CHANGES_THRESHOLD changes, > we try to send > + * a keepalive message if required. > + * > + * We don't want to try sending a keepalive message after > processing each > + * change as that can have overhead. Testing reveals that there is no > + * noticeable overhead in doing it after continuously > processing 100 or so > + * changes. > + */ > +#define CHANGES_THRESHOLD 100 > + if (++changes_count >= CHANGES_THRESHOLD) > + { > + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); > + changes_count = 0; > + } > > Can we merge two if branches since we do the same things? Or did you > separate them for better readability? > I think it is fine to merge the two checks. -- With Regards, Amit Kapila.
On Mon, Apr 18, 2022 at 1:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com > > > <wangw.fnst@fujitsu.com> wrote: > > > > > > > > So I skip tracking lag during a transaction just like the current HEAD. > > > > Attach the new patch. > > > > > > > > > > Thanks, please find the updated patch where I have slightly modified > > > the comments. > > > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > > personally still prefer the approach implemented in v10 [1] especially > > > due to the latest finding by Wang-San that we can't update the > > > lag-tracker apart from when it is invoked at the transaction end. > > > However, I am fine if we like this approach more. > > > > Thank you for updating the patch. > > > > The current patch looks much better than v10 which requires to call to > > update_progress() every path. > > > > Regarding v15 patch, I'm concerned a bit that the new function name, > > update_progress(), is too generic. How about > > update_replation_progress() or something more specific name? > > > > Do you intend to say update_replication_progress()? The word > 'replation' doesn't make sense to me. I am fine with this suggestion. Yeah, that was a typo. I meant update_replication_progress(). Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Apr 18, 2022 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote: > > > > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote: > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > personally still prefer the approach implemented in v10 [1] especially > > due to the latest finding by Wang-San that we can't update the > > lag-tracker apart from when it is invoked at the transaction end. > > However, I am fine if we like this approach more. > > > > It seems v15 is simpler and less error prone than v10. v10 has a mix of > > OutputPluginUpdateProgress() and the new function update_progress(). The v10 > > also calls update_progress() for every change action in pgoutput_change(). It > > is not a good approach for maintainability -- new changes like sequences need > > extra calls. > > > > Okay, let's use the v15 approach as Sawada-San also seems to have a > preference for that. > > > However, as you mentioned there should handle the track lag case. > > > > Both patches change the OutputPluginUpdateProgress() so it cannot be > > backpatched. Are you planning to backpatch it? If so, the boolean variable > > (last_write or end_xacts depending of which version you are considering) could > > be added to LogicalDecodingContext. > > > > If we add it to LogicalDecodingContext then I think we have to always > reset the variable after its use which will make it look ugly and > error-prone. I was not thinking to backpatch it because of the API > change but I guess if we want to backpatch then we can add it to > LogicalDecodingContext for back-branches. I am not sure if that will > look committable but surely we can try. > Even, if we want to add the variable in the struct in back-branches, we need to ensure not to change the size of the struct as it is exposed, see email [1] for a similar mistake we made in another case. [1] - https://www.postgresql.org/message-id/2358496.1649168259%40sss.pgh.pa.us -- With Regards, Amit Kapila.
On Mon, Apr 18, 2022 at 00:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Apr 18, 2022 at 1:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > > > > > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com > > > > <wangw.fnst@fujitsu.com> wrote: > > > > > > > > > > So I skip tracking lag during a transaction just like the current HEAD. > > > > > Attach the new patch. > > > > > > > > > > > > > Thanks, please find the updated patch where I have slightly > > > > modified the comments. > > > > > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > > > personally still prefer the approach implemented in v10 [1] > > > > especially due to the latest finding by Wang-San that we can't > > > > update the lag-tracker apart from when it is invoked at the transaction end. > > > > However, I am fine if we like this approach more. > > > > > > Thank you for updating the patch. > > > > > > The current patch looks much better than v10 which requires to call > > > to > > > update_progress() every path. > > > > > > Regarding v15 patch, I'm concerned a bit that the new function name, > > > update_progress(), is too generic. How about > > > update_replation_progress() or something more specific name? > > > > > > > Do you intend to say update_replication_progress()? The word > > 'replation' doesn't make sense to me. I am fine with this suggestion. > > Yeah, that was a typo. I meant update_replication_progress(). Thanks for your comments. > > > Regarding v15 patch, I'm concerned a bit that the new function name, > > > update_progress(), is too generic. How about > > > update_replation_progress() or something more specific name? Improve as suggested. Change the name from update_progress to update_replication_progress. > > > --- > > > + if (end_xact) > > > + { > > > + /* Update progress tracking at xact end. */ > > > + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); > > > + changes_count = 0; > > > + return; > > > + } > > > + > > > + /* > > > + * After continuously processing CHANGES_THRESHOLD changes, > > > we try to send > > > + * a keepalive message if required. > > > + * > > > + * We don't want to try sending a keepalive message after > > > processing each > > > + * change as that can have overhead. Testing reveals that there is no > > > + * noticeable overhead in doing it after continuously > > > processing 100 or so > > > + * changes. > > > + */ > > > +#define CHANGES_THRESHOLD 100 > > > + if (++changes_count >= CHANGES_THRESHOLD) > > > + { > > > + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); > > > + changes_count = 0; > > > + } > > > > > > Can we merge two if branches since we do the same things? Or did you > > > separate them for better readability? Improve as suggested. Merge two if-branches. Attach the new patch. 1. Rename the new function(update_progress) to update_replication_progress. [suggestion by Sawada-San] 2. Merge two if-branches in new function update_replication_progress. [suggestion by Sawada-San.] 3. Improve comments to make them clear. [suggestions by Euler-San.] Regards, Wang wei
Вложения
On Thur, Apr 14, 2022 at 8:21 PM Euler Taveira <euler@eulerto.com> wrote: > Thanks for your comments. > + * For a large transaction, if we don't send any change to the downstream for a > + * long time then it can timeout. This can happen when all or most of the > + * changes are either not published or got filtered out. > > We should probable mention that "long time" is wal_receiver_timeout on > subscriber. Improve as suggested. Add "(exceeds the wal_receiver_timeout of standby)" to explain what "long time" means. > + * change as that can have overhead. Testing reveals that there is no > + * noticeable overhead in doing it after continuously processing 100 or so > + * changes. > > Tests revealed that ... Improve as suggested. > + * We don't have a mechanism to get the ack for any LSN other than end xact > + * lsn from the downstream. So, we track lag only for end xact lsn's. > > s/lsn/LSN/ and s/lsn's/LSNs/ > > I would say "end of transaction LSN". Improve as suggested. > + * If too many changes are processed then try to send a keepalive message to > + * receiver to avoid timeouts. > > In logical replication, if too many changes are processed then try to send a > keepalive message. It might avoid a timeout in the subscriber. Improve as suggested. Kindly have a look at new patch shared in [1]. [1] - https://www.postgresql.org/message-id/OS3PR01MB627561344A2C7ECF68E41D7E9EF39%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
On Mon, Apr 18, 2022 at 3:16 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Mon, Apr 18, 2022 at 00:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Apr 18, 2022 at 1:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Apr 14, 2022 at 5:50 PM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > > > > > > > > On Wed, Apr 13, 2022 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > > > > > On Mon, Apr 11, 2022 at 12:09 PM wangw.fnst@fujitsu.com > > > > > <wangw.fnst@fujitsu.com> wrote: > > > > > > > > > > > > So I skip tracking lag during a transaction just like the current HEAD. > > > > > > Attach the new patch. > > > > > > > > > > > > > > > > Thanks, please find the updated patch where I have slightly > > > > > modified the comments. > > > > > > > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > > > > personally still prefer the approach implemented in v10 [1] > > > > > especially due to the latest finding by Wang-San that we can't > > > > > update the lag-tracker apart from when it is invoked at the transaction end. > > > > > However, I am fine if we like this approach more. > > > > > > > > Thank you for updating the patch. > > > > > > > > The current patch looks much better than v10 which requires to call > > > > to > > > > update_progress() every path. > > > > > > > > Regarding v15 patch, I'm concerned a bit that the new function name, > > > > update_progress(), is too generic. How about > > > > update_replation_progress() or something more specific name? > > > > > > > > > > Do you intend to say update_replication_progress()? The word > > > 'replation' doesn't make sense to me. I am fine with this suggestion. > > > > Yeah, that was a typo. I meant update_replication_progress(). > Thanks for your comments. > > > > > Regarding v15 patch, I'm concerned a bit that the new function name, > > > > update_progress(), is too generic. How about > > > > update_replation_progress() or something more specific name? > Improve as suggested. Change the name from update_progress to > update_replication_progress. > > > > > --- > > > > + if (end_xact) > > > > + { > > > > + /* Update progress tracking at xact end. */ > > > > + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); > > > > + changes_count = 0; > > > > + return; > > > > + } > > > > + > > > > + /* > > > > + * After continuously processing CHANGES_THRESHOLD changes, > > > > we try to send > > > > + * a keepalive message if required. > > > > + * > > > > + * We don't want to try sending a keepalive message after > > > > processing each > > > > + * change as that can have overhead. Testing reveals that there is no > > > > + * noticeable overhead in doing it after continuously > > > > processing 100 or so > > > > + * changes. > > > > + */ > > > > +#define CHANGES_THRESHOLD 100 > > > > + if (++changes_count >= CHANGES_THRESHOLD) > > > > + { > > > > + OutputPluginUpdateProgress(ctx, skipped_xact, end_xact); > > > > + changes_count = 0; > > > > + } > > > > > > > > Can we merge two if branches since we do the same things? Or did you > > > > separate them for better readability? > Improve as suggested. Merge two if-branches. > > Attach the new patch. > 1. Rename the new function(update_progress) to update_replication_progress. [suggestion by Sawada-San] > 2. Merge two if-branches in new function update_replication_progress. [suggestion by Sawada-San.] > 3. Improve comments to make them clear. [suggestions by Euler-San.] Thank you for updating the patch. + * For a large transaction, if we don't send any change to the downstream for a + * long time(exceeds the wal_receiver_timeout of standby) then it can timeout. + * This can happen when all or most of the changes are either not published or + * got filtered out. + */ + if(end_xact || ++changes_count >= CHANGES_THRESHOLD) + { We need a whitespace before '(' at above two places. The rest looks good to me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Apr 19, 2022 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating the patch. Thanks for your comments. > + * For a large transaction, if we don't send any change to the > + downstream for a > + * long time(exceeds the wal_receiver_timeout of standby) then it can > timeout. > + * This can happen when all or most of the changes are either not > + published or > + * got filtered out. > > + */ > + if(end_xact || ++changes_count >= CHANGES_THRESHOLD) { > > We need a whitespace before '(' at above two places. The rest looks good to me. Fix these. Attach the new patch. 1. Fix wrong formatting. [suggestion by Sawada-San] Regards, Wang wei
Вложения
On Mon, Apr 18, 2022 at 00:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Apr 18, 2022 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote: > > > > > > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote: > > > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > > personally still prefer the approach implemented in v10 [1] > > > especially due to the latest finding by Wang-San that we can't > > > update the lag-tracker apart from when it is invoked at the transaction end. > > > However, I am fine if we like this approach more. > > > > > > It seems v15 is simpler and less error prone than v10. v10 has a mix > > > of > > > OutputPluginUpdateProgress() and the new function update_progress(). > > > The v10 also calls update_progress() for every change action in > > > pgoutput_change(). It is not a good approach for maintainability -- > > > new changes like sequences need extra calls. > > > > > > > Okay, let's use the v15 approach as Sawada-San also seems to have a > > preference for that. > > > > > However, as you mentioned there should handle the track lag case. > > > > > > Both patches change the OutputPluginUpdateProgress() so it cannot be > > > backpatched. Are you planning to backpatch it? If so, the boolean > > > variable (last_write or end_xacts depending of which version you are > > > considering) could be added to LogicalDecodingContext. > > > > > > > If we add it to LogicalDecodingContext then I think we have to always > > reset the variable after its use which will make it look ugly and > > error-prone. I was not thinking to backpatch it because of the API > > change but I guess if we want to backpatch then we can add it to > > LogicalDecodingContext for back-branches. I am not sure if that will > > look committable but surely we can try. > > > > Even, if we want to add the variable in the struct in back-branches, we need to > ensure not to change the size of the struct as it is exposed, see email [1] for a > similar mistake we made in another case. > > [1] - https://www.postgresql.org/message- > id/2358496.1649168259%40sss.pgh.pa.us Thanks for your comments. I did some checks about adding the new variable in LogicalDecodingContext. I found that because of padding, if we add the new variable at the end of structure, it dose not make the structure size change. I verified this in REL_10~REL_14. So as suggested by Euler-San and Amit-San, I wrote the patch for REL_14. Attach this patch. To prevent patch confusion, the patch for HEAD is also attached. The patch for REL_14: REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch The patch for HEAD: v17-0001-Fix-the-logical-replication-timeout-during-large.patch The following is the details of checks. On gcc/Linux/x86-64, in REL_14, by looking at the size of each member variable in the structure LogicalDecodingContext, I found that there are three parts padding behind the following member variables: - 7 bytes after fast_forward - 4 bytes after prepared_write - 4 bytes after write_xid If we add the new variable at the end of structure (bool takes one byte), it means we will only consume one byte of padding after member write_xid. And then, at the end of the struct, 3 padding are still required. For easy understanding, please refer to the following simple calculation. (In REL14, the size of structure LogicalDecodingContext is 304 bytes.) Before adding new variable (In REL14): 8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4 = 289 (if padding is not considered) +7 +4 +4 = +15 (the padding) So, the size of structure LogicalDecodingContext is 289+15=304. After adding new variable (In REL14 with patch): 8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4+1 = 290 (if padding is not considered) +7 +4 +3 = +14 (the padding) So, the size of structure LogicalDecodingContext is 290+14=304. BTW, the size of structure LogicalDecodingContext in REL_10~REL_13 is 184, 200, 200,200 respectively. And I found that at the end of the structure LogicalDecodingContext, there are always the following members: ``` XLogRecPtr write_location; --> 8 TransactionId write_xid; --> 4 --> There are 4 padding after write_xid. ``` It means at the end of structure LogicalDecodingContext, there are 4 bytes padding. So, if we add a new bool type variable (It takes one byte) at the end of the structure LogicalDecodingContext, I think in the current REL_10~REL_14, because of padding, the size of the structure LogicalDecodingContext will not change. Regards, Wang wei
Вложения
On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Mon, Apr 18, 2022 at 00:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 18, 2022 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Apr 14, 2022 at 5:52 PM Euler Taveira <euler@eulerto.com> wrote: > > > > > > > > On Wed, Apr 13, 2022, at 7:45 AM, Amit Kapila wrote: > > > > > > > > Sawada-San, Euler, do you have any opinion on this approach? I > > > > personally still prefer the approach implemented in v10 [1] > > > > especially due to the latest finding by Wang-San that we can't > > > > update the lag-tracker apart from when it is invoked at the transaction end. > > > > However, I am fine if we like this approach more. > > > > > > > > It seems v15 is simpler and less error prone than v10. v10 has a mix > > > > of > > > > OutputPluginUpdateProgress() and the new function update_progress(). > > > > The v10 also calls update_progress() for every change action in > > > > pgoutput_change(). It is not a good approach for maintainability -- > > > > new changes like sequences need extra calls. > > > > > > > > > > Okay, let's use the v15 approach as Sawada-San also seems to have a > > > preference for that. > > > > > > > However, as you mentioned there should handle the track lag case. > > > > > > > > Both patches change the OutputPluginUpdateProgress() so it cannot be > > > > backpatched. Are you planning to backpatch it? If so, the boolean > > > > variable (last_write or end_xacts depending of which version you are > > > > considering) could be added to LogicalDecodingContext. > > > > > > > > > > If we add it to LogicalDecodingContext then I think we have to always > > > reset the variable after its use which will make it look ugly and > > > error-prone. I was not thinking to backpatch it because of the API > > > change but I guess if we want to backpatch then we can add it to > > > LogicalDecodingContext for back-branches. I am not sure if that will > > > look committable but surely we can try. > > > > > > > Even, if we want to add the variable in the struct in back-branches, we need to > > ensure not to change the size of the struct as it is exposed, see email [1] for a > > similar mistake we made in another case. > > > > [1] - https://www.postgresql.org/message- > > id/2358496.1649168259%40sss.pgh.pa.us > Thanks for your comments. > > I did some checks about adding the new variable in LogicalDecodingContext. > I found that because of padding, if we add the new variable at the end of > structure, it dose not make the structure size change. I verified this in > REL_10~REL_14. > > So as suggested by Euler-San and Amit-San, I wrote the patch for REL_14. Attach > this patch. To prevent patch confusion, the patch for HEAD is also attached. > The patch for REL_14: > REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch > The patch for HEAD: > v17-0001-Fix-the-logical-replication-timeout-during-large.patch > > The following is the details of checks. > On gcc/Linux/x86-64, in REL_14, by looking at the size of each member variable > in the structure LogicalDecodingContext, I found that there are three parts > padding behind the following member variables: > - 7 bytes after fast_forward > - 4 bytes after prepared_write > - 4 bytes after write_xid > > If we add the new variable at the end of structure (bool takes one byte), it > means we will only consume one byte of padding after member write_xid. And > then, at the end of the struct, 3 padding are still required. For easy > understanding, please refer to the following simple calculation. > (In REL14, the size of structure LogicalDecodingContext is 304 bytes.) > Before adding new variable (In REL14): > 8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4 = 289 (if padding is not considered) > +7 +4 +4 = +15 (the padding) > So, the size of structure LogicalDecodingContext is 289+15=304. > After adding new variable (In REL14 with patch): > 8+8+8+8+8+1+168+8+8+8+8+8+8+8+8+1+1+1+1+8+4+1 = 290 (if padding is not considered) > +7 +4 +3 = +14 (the padding) > So, the size of structure LogicalDecodingContext is 290+14=304. > > BTW, the size of structure LogicalDecodingContext in REL_10~REL_13 is 184, 200, > 200,200 respectively. And I found that at the end of the structure > LogicalDecodingContext, there are always the following members: > ``` > XLogRecPtr write_location; --> 8 > TransactionId write_xid; --> 4 > --> There are 4 padding after write_xid. > ``` I'm concerned that this 4-byte padding at the end of the struct could depend on platforms (there might be no padding in 32-bit platforms?). It seems to me that it's better to put it after fast_forward where the new field should fall within the padding space. BTW the changes in REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch, adding end_xact to LogicalDecodingContext, seems good to me and it might be better than the approach of v17 patch from plugin developers’ perspective? This is because they won’t need to pass true/false to end_xact of OutputPluginUpdateProgress(). Furthermore, if we do what we do in update_replication_progress() in OutputPluginUpdateProgress(), what plugins need to do will be just to call OutputPluginUpdate() in every callback and they don't need to have the CHANGES_THRESHOLD logic. What do you think? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > ``` > > I'm concerned that this 4-byte padding at the end of the struct could > depend on platforms (there might be no padding in 32-bit platforms?). > Good point, but ... > It seems to me that it's better to put it after fast_forward where the > new field should fall within the padding space. > Can we add the variable in between the existing variables in the structure in the back branches? Normally, we add at the end to avoid any breakage of existing apps. See commit 56e366f675 and discussion at [1]. That is related to enum but I think we follow the same for structures. [1] - https://www.postgresql.org/message-id/7dab0929-a966-0c0a-4726-878fced2fe00%40enterprisedb.com -- With Regards, Amit Kapila.
On Wed, Apr 20, 2022 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > > ``` > > > > I'm concerned that this 4-byte padding at the end of the struct could > > depend on platforms (there might be no padding in 32-bit platforms?). > > > > Good point, but ... > > > It seems to me that it's better to put it after fast_forward where the > > new field should fall within the padding space. > > > > Can we add the variable in between the existing variables in the > structure in the back branches? > I think it should be fine if it falls in the padding space. We have done similar changes recently in back-branches [1]. I think it would be then better to have it in the same place in HEAD as well? [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10520f4346876aad4941797c2255a21bdac74739 -- With Regards, Amit Kapila.
On Wed, Apr 20, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 20, 2022 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com > > > <wangw.fnst@fujitsu.com> wrote: > > > > ``` > > > > > > I'm concerned that this 4-byte padding at the end of the struct could > > > depend on platforms (there might be no padding in 32-bit platforms?). > > > > > > > Good point, but ... > > > > > It seems to me that it's better to put it after fast_forward where the > > > new field should fall within the padding space. > > > > > > > Can we add the variable in between the existing variables in the > > structure in the back branches? > > > > I think it should be fine if it falls in the padding space. We have > done similar changes recently in back-branches [1]. Yes. > I think it would > be then better to have it in the same place in HEAD as well? As far as I can see in the v17 patch, which is for HEAD, we don't add a variable to LogicalDecodingContext, but did you refer to another patch? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Apr 20, 2022 at 6:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 20, 2022 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 20, 2022 at 12:51 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Apr 20, 2022 at 11:46 AM wangw.fnst@fujitsu.com > > > <wangw.fnst@fujitsu.com> wrote: > > > > ``` > > > > > > I'm concerned that this 4-byte padding at the end of the struct could > > > depend on platforms (there might be no padding in 32-bit platforms?). > > > > > > > Good point, but ... > > > > > It seems to me that it's better to put it after fast_forward where the > > > new field should fall within the padding space. > > > > > > > Can we add the variable in between the existing variables in the > > structure in the back branches? > > > > I think it should be fine if it falls in the padding space. We have > done similar changes recently in back-branches [1]. I think it would > be then better to have it in the same place in HEAD as well? > > [1] - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10520f4346 > 876aad4941797c2255a21bdac74739 Thanks for your comments. The comments by Sawada-San sound reasonable to me. After doing check, I found that padding in HEAD is the same as in REL14. So I change the approach of patch for HEAD just like the patch for REL14. On Wed, Apr 20, 2022 at 3:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I'm concerned that this 4-byte padding at the end of the struct could > depend on platforms (there might be no padding in 32-bit platforms?). > It seems to me that it's better to put it after fast_forward where the > new field should fall within the padding space. Fixed. Add new variable after fast_forward. > BTW the changes in > REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch, > adding end_xact to LogicalDecodingContext, seems good to me and it > might be better than the approach of v17 patch from plugin developers’ > perspective? This is because they won’t need to pass true/false to > end_xact of OutputPluginUpdateProgress(). Furthermore, if we do what > we do in update_replication_progress() in > OutputPluginUpdateProgress(), what plugins need to do will be just to > call OutputPluginUpdate() in every callback and they don't need to > have the CHANGES_THRESHOLD logic. What do you think? Change the approach of patch for HEAD. (The size of structure does not change.) Also move the logical of function update_replication_progress to function OutputPluginUpdateProgress. Attach the patches. [suggestion by Sawada-San] 1. Change the position of the new variable in structure. 2. Change the approach of the patch for HEAD. 3. Delete the new function update_replication_progress. Regards, Wang wei
Вложения
On Wed, Apr 20, 2022 at 6:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Apr 20, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > I think it would > > be then better to have it in the same place in HEAD as well? > > As far as I can see in the v17 patch, which is for HEAD, we don't add > a variable to LogicalDecodingContext, but did you refer to another > patch? > No, I thought it is better to follow the same approach in HEAD as well. Do you see any problem with it? -- With Regards, Amit Kapila.
On Thu, Apr 21, 2022 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 20, 2022 at 6:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Apr 20, 2022 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > I think it would > > > be then better to have it in the same place in HEAD as well? > > > > As far as I can see in the v17 patch, which is for HEAD, we don't add > > a variable to LogicalDecodingContext, but did you refer to another > > patch? > > > > No, I thought it is better to follow the same approach in HEAD as > well. Do you see any problem with it? No, that makes sense to me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Apr 21, 2022 at 10:15 AM I wrote: > The comments by Sawada-San sound reasonable to me. > After doing check, I found that padding in HEAD is the same as in REL14. > So I change the approach of patch for HEAD just like the patch for REL14. Also attach the back-branch patches for REL10~REL13. (REL12 and REL11 patch are the same, so only post one patch for these two branches.) The patch for HEAD: HEAD_v18-0001-Fix-the-logical-replication-timeout-during-large.patch The patch for REL14: REL14_v2-0001-Fix-the-logical-replication-timeout-during-large-.patch The patch for REL13: REL13_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch The patch for REL12 and REL11: REL12-REL11_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch The patch for REL10: REL10_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch BTW, after doing check, I found that padding in REL11~REL13 are similar as HEAD and REL14 (7 bytes padding after fast_forward). But in REL10, the padding is different. There are three parts padding behind the following member variables: - 4 bytes after options - 6 bytes after prepared_write - 4 bytes after write_xid So, in the patches for branches REL11~HEAD, I add the new variable after fast_forward. In the patch for branch REL10, I add the new variable after prepared_write. For each version, the size of the structure does not change after applying the patch. Regards, Wang wei
Вложения
- HEAD_v18-0001-Fix-the-logical-replication-timeout-during-large.patch
- REL14_v2-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL13_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL12-REL11_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL10_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch
On Wednesday, April 20, 2022 3:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > BTW the changes in > REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch, > adding end_xact to LogicalDecodingContext, seems good to me and it > might be better than the approach of v17 patch from plugin developers’ > perspective? This is because they won’t need to pass true/false to > end_xact of OutputPluginUpdateProgress(). Furthermore, if we do what > we do in update_replication_progress() in > OutputPluginUpdateProgress(), what plugins need to do will be just to > call OutputPluginUpdate() in every callback and they don't need to > have the CHANGES_THRESHOLD logic. What do you think? Hi Sawada-san, Wang I was looking at the patch and noticed that we moved some logic from update_replication_progress() to OutputPluginUpdateProgress() like your suggestion. I have a question about this change. In the patch we added some restriction in this function OutputPluginUpdateProgress() like below: + /* + * If we are at the end of transaction LSN, update progress tracking. + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we + * try to send a keepalive message if required. + */ + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD) + { + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, + skipped_xact); + changes_count = 0; + } After the patch, we won't be able to always invoke the update_progress() if the caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to change this at back-branches ? Because OutputPluginUpdateProgress() is a public function for the extension developer, just a little concerned about the behavior change here. Besides, the check of 'end_xact' and the 'CHANGES_THRESHOLD' seems specified to the pgoutput. I am not very sure that if plugin author also needs these logic(they might want to change the strategy), so I'd like to confirm it with you. Best regards, Hou zj
On Thu, Apr 21, 2022 at 3:21 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > I think it is better to keep the new variable 'end_xact' at the end of the struct where it belongs for HEAD. In back branches, we can keep it at the place as you have. Apart from that, I have made some cosmetic changes and changed a few comments in the attached. Let's use this to prepare patches for back-branches. -- With Regards, Amit Kapila.
Вложения
On Thur, Apr 28, 2022 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Apr 21, 2022 at 3:21 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > I think it is better to keep the new variable 'end_xact' at the end of > the struct where it belongs for HEAD. In back branches, we can keep it > at the place as you have. Apart from that, I have made some cosmetic > changes and changed a few comments in the attached. Let's use this to > prepare patches for back-branches. Thanks for your review and improvement. I improved the back-branch patches according to your modifications. Attach the back-branch patches for REL10~REL14. (Also attach the patch for HEAD, I did not make any changes to this patch.) BTW, I found Hou-san shared some points. After our discussion, I will update the patches if required. Regards, Wang wei
Вложения
- HEAD_v19-0001-Fix-the-logical-replication-timeout-during-large.patch
- REL14_v3-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL13_v2-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL12-REL11_v2-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL10_v2-0001-Fix-the-logical-replication-timeout-during-large-.patch
On Thu, Apr 28, 2022 at 7:01 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, April 20, 2022 3:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > BTW the changes in > > REL_14_v1-0001-Fix-the-logical-replication-timeout-during-large-.patch, > > adding end_xact to LogicalDecodingContext, seems good to me and it > > might be better than the approach of v17 patch from plugin developers’ > > perspective? This is because they won’t need to pass true/false to > > end_xact of OutputPluginUpdateProgress(). Furthermore, if we do what > > we do in update_replication_progress() in > > OutputPluginUpdateProgress(), what plugins need to do will be just to > > call OutputPluginUpdate() in every callback and they don't need to > > have the CHANGES_THRESHOLD logic. What do you think? > > Hi Sawada-san, Wang > > I was looking at the patch and noticed that we moved some logic from > update_replication_progress() to OutputPluginUpdateProgress() like > your suggestion. > > I have a question about this change. In the patch we added some > restriction in this function OutputPluginUpdateProgress() like below: > > + /* > + * If we are at the end of transaction LSN, update progress tracking. > + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we > + * try to send a keepalive message if required. > + */ > + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD) > + { > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, > + skipped_xact); > + changes_count = 0; > + } > > After the patch, we won't be able to always invoke the update_progress() if the > caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the > public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to > change this at back-branches ? > > Because OutputPluginUpdateProgress() is a public function for the extension > developer, just a little concerned about the behavior change here. Good point. As you pointed out, it would not be good if we change the behavior of OutputPluginUpdateProgress() in back branches. Also, after more thought, it is not a good idea even for HEAD since there might be background workers that use logical decoding and the timeout checking might not be relevant at all with them. BTW, I think you're concerned about the plugins that call OutputPluginUpdateProgress() at the middle of the transaction (i.e., end_xact = false). We have the following change that makes the walsender not update the progress at the middle of the transaction. Do you think it is okay since it's not common usage to call OutputPluginUpdateProgress() at the middle of the transaction by the plugin that is used by the walsender? #define WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS 1000 - if (!TimestampDifferenceExceeds(sendTime, now, + if (end_xact && TimestampDifferenceExceeds(sendTime, now, WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS)) - return; + { + LagTrackerWrite(lsn, now); + sendTime = now; + } Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, May 2, 2022 at 7:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Apr 28, 2022 at 7:01 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > Hi Sawada-san, Wang > > > > I was looking at the patch and noticed that we moved some logic from > > update_replication_progress() to OutputPluginUpdateProgress() like > > your suggestion. > > > > I have a question about this change. In the patch we added some > > restriction in this function OutputPluginUpdateProgress() like below: > > > > + /* > > + * If we are at the end of transaction LSN, update progress tracking. > > + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we > > + * try to send a keepalive message if required. > > + */ > > + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD) > > + { > > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, > > + skipped_xact); > > + changes_count = 0; > > + } > > > > After the patch, we won't be able to always invoke the update_progress() if the > > caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the > > public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to > > change this at back-branches ? > > > > Because OutputPluginUpdateProgress() is a public function for the extension > > developer, just a little concerned about the behavior change here. > > Good point. > > As you pointed out, it would not be good if we change the behavior of > OutputPluginUpdateProgress() in back branches. Also, after more > thought, it is not a good idea even for HEAD since there might be > background workers that use logical decoding and the timeout checking > might not be relevant at all with them. > So, shall we go back to the previous approach of using a separate function update_replication_progress? > BTW, I think you're concerned about the plugins that call > OutputPluginUpdateProgress() at the middle of the transaction (i.e., > end_xact = false). We have the following change that makes the > walsender not update the progress at the middle of the transaction. Do > you think it is okay since it's not common usage to call > OutputPluginUpdateProgress() at the middle of the transaction by the > plugin that is used by the walsender? > We have done that purposefully as otherwise, the lag tracker shows incorrect information. See email [1]. The reason is that we always get ack from subscribers for transaction end. Also, prior to this patch we never call the lag tracker recording apart from the transaction end, so as a bug fix we shouldn't try to change it. [1] - https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 2, 2022 at 7:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Apr 28, 2022 at 7:01 PM houzj.fnst@fujitsu.com > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Hi Sawada-san, Wang > > > > > > I was looking at the patch and noticed that we moved some logic from > > > update_replication_progress() to OutputPluginUpdateProgress() like > > > your suggestion. > > > > > > I have a question about this change. In the patch we added some > > > restriction in this function OutputPluginUpdateProgress() like below: > > > > > > + /* > > > + * If we are at the end of transaction LSN, update progress tracking. > > > + * Otherwise, after continuously processing CHANGES_THRESHOLD changes, we > > > + * try to send a keepalive message if required. > > > + */ > > > + if (ctx->end_xact || ++changes_count >= CHANGES_THRESHOLD) > > > + { > > > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, > > > + skipped_xact); > > > + changes_count = 0; > > > + } > > > > > > After the patch, we won't be able to always invoke the update_progress() if the > > > caller are at the middle of transaction(e.g. end_xact = false). The bebavior of the > > > public function OutputPluginUpdateProgress() is changed. I am thinking is it ok to > > > change this at back-branches ? > > > > > > Because OutputPluginUpdateProgress() is a public function for the extension > > > developer, just a little concerned about the behavior change here. > > > > Good point. > > > > As you pointed out, it would not be good if we change the behavior of > > OutputPluginUpdateProgress() in back branches. Also, after more > > thought, it is not a good idea even for HEAD since there might be > > background workers that use logical decoding and the timeout checking > > might not be relevant at all with them. > > > > So, shall we go back to the previous approach of using a separate > function update_replication_progress? Ok, agreed. > > > BTW, I think you're concerned about the plugins that call > > OutputPluginUpdateProgress() at the middle of the transaction (i.e., > > end_xact = false). We have the following change that makes the > > walsender not update the progress at the middle of the transaction. Do > > you think it is okay since it's not common usage to call > > OutputPluginUpdateProgress() at the middle of the transaction by the > > plugin that is used by the walsender? > > > > We have done that purposefully as otherwise, the lag tracker shows > incorrect information. See email [1]. The reason is that we always get > ack from subscribers for transaction end. Also, prior to this patch we > never call the lag tracker recording apart from the transaction end, > so as a bug fix we shouldn't try to change it. Make sense. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > So, shall we go back to the previous approach of using a separate > > function update_replication_progress? > > Ok, agreed. > Attached, please find the updated patch accordingly. Currently, I have prepared it for HEAD, if you don't see any problem with this, we can prepare the back-branch patches based on this. -- With Regards, Amit Kapila.
Вложения
On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > So, shall we go back to the previous approach of using a separate > > > function update_replication_progress? > > > > Ok, agreed. > > > > Attached, please find the updated patch accordingly. Currently, I have > prepared it for HEAD, if you don't see any problem with this, we can > prepare the back-branch patches based on this. Thank you for updating the patch. Looks good to me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, May 6, 2022 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > > > > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > > > So, shall we go back to the previous approach of using a separate > > > > function update_replication_progress? > > > > > > Ok, agreed. > > > > > > > Attached, please find the updated patch accordingly. Currently, I have > > prepared it for HEAD, if you don't see any problem with this, we can > > prepare the back-branch patches based on this. > > Thank you for updating the patch. Looks good to me. Thanks for your review. Improve the back-branch patches according to the discussion. Move the CHANGES_THRESHOLD logic from function OutputPluginUpdateProgress to new funcion update_replication_progress. In addition, improve all patches formatting with pgindent. Attach the patches. Regards, Wang wei
Вложения
- HEAD_v21-0001-Fix-the-logical-replication-timeout-during-large.patch
- REL14_v4-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL13_v3-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL12_v3-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL11_v3-0001-Fix-the-logical-replication-timeout-during-large-.patch
- REL10_v3-0001-Fix-the-logical-replication-timeout-during-large-.patch
On Fri, May 6, 2022 at 12:42 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Fri, May 6, 2022 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > > > > > > > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > > > > > > > > > > So, shall we go back to the previous approach of using a separate > > > > > function update_replication_progress? > > > > > > > > Ok, agreed. > > > > > > > > > > Attached, please find the updated patch accordingly. Currently, I have > > > prepared it for HEAD, if you don't see any problem with this, we can > > > prepare the back-branch patches based on this. > > > > Thank you for updating the patch. Looks good to me. > Thanks for your review. > > Improve the back-branch patches according to the discussion. > Thanks. The patch LGTM. I'll push and back-patch this after the current minor release is done unless there are more comments related to this work. -- With Regards, Amit Kapila.
On Mon, May 9, 2022 at 3:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 6, 2022 at 12:42 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Fri, May 6, 2022 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > On Wed, May 4, 2022 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, May 2, 2022 at 8:07 AM Masahiko Sawada <sawada.mshk@gmail.com> > > > wrote: > > > > > > > > > > On Mon, May 2, 2022 at 11:32 AM Amit Kapila <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > > So, shall we go back to the previous approach of using a separate > > > > > > function update_replication_progress? > > > > > > > > > > Ok, agreed. > > > > > > > > > > > > > Attached, please find the updated patch accordingly. Currently, I have > > > > prepared it for HEAD, if you don't see any problem with this, we can > > > > prepare the back-branch patches based on this. > > > > > > Thank you for updating the patch. Looks good to me. > > Thanks for your review. > > > > Improve the back-branch patches according to the discussion. > > > The patches look good to me too. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Thanks. The patch LGTM. I'll push and back-patch this after thecurrent minor release is done unless there are more comments relatedto this work.
On Mon, May 9, 2022 at 7:01 PM Euler Taveira <euler@eulerto.com> wrote: > > On Mon, May 9, 2022, at 3:47 AM, Amit Kapila wrote: > > Thanks. The patch LGTM. I'll push and back-patch this after the > current minor release is done unless there are more comments related > to this work. > > Looks sane to me. (I only tested the HEAD version) > > + bool end_xact = ctx->end_xact; > > Do you really need a new variable here? It has the same name and the new one > isn't changed during the execution. > I think both ways should be okay. I thought the proposed way is okay because it is used in more than one place and is probably slightly easier to follow by having a separate variable. > Does this issue deserve a test? A small wal_receiver_timeout. Although, I'm not > sure how stable the test will be. > Yes, the main part is how to write a stable test because estimating how many changes are enough for the configured wal_receiver_timeout to pass on all the buildfarm machines is tricky. Also, I am not sure how important is to test this behavior because based on this theory we should have tests for all kinds of timeouts. -- With Regards, Amit Kapila.
On Mon, May 9, 2022 at 11:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, May 9, 2022 at 7:01 PM Euler Taveira <euler@eulerto.com> wrote: > > > > On Mon, May 9, 2022, at 3:47 AM, Amit Kapila wrote: > > > > Thanks. The patch LGTM. I'll push and back-patch this after the > > current minor release is done unless there are more comments related > > to this work. > > ...... > > Does this issue deserve a test? A small wal_receiver_timeout. Although, I'm > not > > sure how stable the test will be. > > > > Yes, the main part is how to write a stable test because estimating > how many changes are enough for the configured wal_receiver_timeout to > pass on all the buildfarm machines is tricky. Also, I am not sure how > important is to test this behavior because based on this theory we > should have tests for all kinds of timeouts. Yse, I think we could not guarantee the stability of this test. In addition, if we set wal_receiver_timeout too small, it may cause timeout unrelated to this bug. And if we set bigger wal_receiver_timeout and use larger transaction in order to minimize the impact of machine performance, I think this might take some time and might risk making the build farm slow. Regards, Wang wei
On Mon, May 9, 2022 at 2:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The patches look good to me too. > Pushed. -- With Regards, Amit Kapila.
On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
>
> This is exactly our initial analysis and we have tried a patch on
> these lines and it has a noticeable overhead. See [1]. Calling this
> for each change or each skipped change can bring noticeable overhead
> that is why we decided to call it after a certain threshold (100) of
> skipped changes. Now, surely as mentioned in my previous reply we can
> make it generic such that instead of calling this (update_progress
> function as in the patch) for skipped cases, we call it always. Will
> that make it better?
>
> That's what I have in mind but using a different approach.
>
> > The functions CreateInitDecodingContext and CreateDecodingContext receives the
> > update_progress function as a parameter. These functions are called in 2
> > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and (b)
> > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > WalSndUpdateProgress as a progress function. Case (b) does not have one because
> > it is not required -- local decoding/communication. There is no custom update
> > progress routine for each output plugin which leads me to the question:
> > couldn't we encapsulate the update progress call into the callback functions?
> >
>
> Sorry, I don't get your point. What exactly do you mean by this?
> AFAIS, currently we call this output plugin API in pgoutput functions
> only, do you intend to get it invoked from a different place?
>
> It seems I didn't make myself clear. The callbacks I'm referring to the
> *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> *_cb_wrapper() function, we have something like:
>
> if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> NewUpdateProgress(ctx, false);
>
> The NewUpdateProgress function would contain a logic similar to the
> update_progress() from the proposed patch. (A different function name here just
> to avoid confusion.)
>
> The output plugin is responsible to set ctx->progress with the callback
> variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) that we would
> like to run NewUpdateProgress.
>
This sounds like a conflicting approach to what we currently do.
Currently, OutputPluginUpdateProgress() is called from the xact
related pgoutput functions like pgoutput_commit_txn(),
pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
follow what you are saying then for some of the APIs like
pgoutput_change/_message/_truncate, we need to set the parameter to
invoke NewUpdateProgress() which will internally call
OutputPluginUpdateProgress(), and for the remaining APIs, we will call
in the corresponding pgoutput_* function. I feel if we want to make it
more generic than the current patch, it is better to directly call
what you are referring to here as NewUpdateProgress() in all remaining
APIs like pgoutput_change/_truncate, etc.
--
With Regards,
Amit Kapila.
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > Hello Amit, > > In version 14.4 the timeout problem for logical replication happens again despite > the patch provided for this issue in this version. When bulky materialized views > are reloaded it broke logical replication. It is possible to solve this problem by > using your new "streaming" option. > Have you ever had this issue reported to you? > > Regards > > Fabrice > > 2022-10-10 17:19:02 CEST [538424]: [17-1] > user=postgres,db=dbxxxa00,client=[local] CONTEXT: SQL statement "REFRESH > MATERIALIZED VIEW sxxxa00.table_base" > PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE > 2022-10-10 17:19:02 CEST [538424]: [18-1] > user=postgres,db=dbxxxa00,client=[local] STATEMENT: select > refresh_materialized_view('sxxxa00.table_base'); > 2022-10-10 17:19:02 CEST [538424]: [19-1] > user=postgres,db=dbxxxa00,client=[local] LOG: duration: 264815.652 > ms statement: select refresh_materialized_view('sxxxa00.table_base'); > 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG: automatic > vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0 > pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen > tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable, > oldest xmin: 1501528 > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item > identifiers removed > I/O timings: read: 1.494 ms, write: 0.000 ms > avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s > buffer usage: 593301 hits, 77 misses, 294605 dirtied > WAL usage: 296644 records, 46119 full page images, 173652718 bytes > system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s > 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG: automatic > analyze of table "dbxxxa00.sxxxa00.table_base" > I/O timings: read: 0.043 ms, write: 0.000 ms > avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s > buffer usage: 30308 hits, 2 misses, 2 dirtied > system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s > 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG: checkpoint > complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; > write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00 > 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB > 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR: terminating > logical replication worker due to timeout > 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG: background > worker "logical replication worker" (PID 716163) exited with exit code 1 > 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG: logical > replication apply worker for subscription "subxxx_sxxxa00" has started Thanks for reporting! There is one thing I want to confirm: Is the statement `select refresh_materialized_view('sxxxa00.table_base');` executed on the publisher-side? If so, I think the reason for this timeout problem could be that during DDL (`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to rewrite. Since these temporary data will not be processed by the pgoutput plugin, our previous fix for DML had no impact on this case. I think setting "streaming" option to "on" could work around this problem. I tried to write a draft patch (see attachment) on REL_14_4 to fix this. I tried it locally and it seems to work. Could you please confirm whether this problem is fixed after applying this draft patch? If this draft patch works, I will improve it and try to fix this problem. Regards, Wang wei
Вложения
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT: SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
> PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT: select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG: duration: 264815.652
> ms statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG: automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
> pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
> tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> I/O timings: read: 1.494 ms, write: 0.000 ms
> avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
> buffer usage: 593301 hits, 77 misses, 294605 dirtied
> WAL usage: 296644 records, 46119 full page images, 173652718 bytes
> system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG: automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
> I/O timings: read: 0.043 ms, write: 0.000 ms
> avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
> buffer usage: 30308 hits, 2 misses, 2 dirtied
> system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG: checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR: terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG: background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG: logical
> replication apply worker for subscription "subxxx_sxxxa00" has started
Thanks for reporting!
There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?
If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.
I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?
If this draft patch works, I will improve it and try to fix this problem.
Regards,
Wang wei
On Thurs, Oct 20, 2022 at 13:47 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > Yes the refresh of MV is on the Publisher Side. > Thanks for your draft patch, I'll try it > I'll back to you as soonas possible Thanks a lot. > One question: why the refresh of the MV is a DDL not a DML? Since in the source, the type of command `REFRESH MATERIALIZED VIEW` is `CMD_UTILITY`, I think this command is DDL (see CmdType in file nodes.h). BTW, after trying to search for DML in the pg-doc, I found the relevant description in the below link: https://www.postgresql.org/docs/devel/logical-replication-publication.html Regards, Wang wei
On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT: SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
> PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT: select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG: duration: 264815.652
> ms statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG: automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
> pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
> tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> I/O timings: read: 1.494 ms, write: 0.000 ms
> avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
> buffer usage: 593301 hits, 77 misses, 294605 dirtied
> WAL usage: 296644 records, 46119 full page images, 173652718 bytes
> system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG: automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
> I/O timings: read: 0.043 ms, write: 0.000 ms
> avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
> buffer usage: 30308 hits, 2 misses, 2 dirtied
> system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG: checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR: terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG: background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG: logical
> replication apply worker for subscription "subxxx_sxxxa00" has started
Thanks for reporting!
There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?
If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.
I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?
If this draft patch works, I will improve it and try to fix this problem.
Regards,
Wang wei
On Fri, Nov 4, 2022 at 18:13 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > Hello Wang, > > I tested the draft patch in my lab for Postgres 14.4, the refresh of the > materialized view ran without generating the timeout on the worker. > Do you plan to propose this patch at the next commit fest. Thanks for your confirmation! I will add this thread to the commit fest soon. The following is the problem analysis and fix approach: I think the problem is when there is a DDL in a transaction that generates lots of temporary data due to rewrite rules, these temporary data will not be processed by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for DML had no impact on this case. To fix this, I think we need to try to send the keepalive messages after each change is processed by walsender, not in the pgoutput-plugin. Attach the patch. Regards, Wang wei
Вложения
Hi Wang, Thanks for working on this. One of our customer faced a similar situation when running BDR with PostgreSQL. I tested your patch and it solves the problem. Please find some review comments below On Tue, Nov 8, 2022 at 8:34 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > > Attach the patch. > +/* + * Helper function for ReorderBufferProcessTXN for updating progress. + */ +static inline void +ReorderBufferUpdateProgress(ReorderBuffer *rb, ReorderBufferTXN *txn, + ReorderBufferChange *change) +{ + LogicalDecodingContext *ctx = rb->private_data; + static int changes_count = 0; It's not easy to know that a variable is static when reading the code which uses it. So it's easy to interpret code wrong. I would probably track it through logical decoding context itself OR through a global variable like other places where we track the last timestamps. But there's more below on this. + + if (!ctx->update_progress) + return; + + Assert(!ctx->fast_forward); + + /* set output state */ + ctx->accept_writes = false; + ctx->write_xid = txn->xid; + ctx->write_location = change->lsn; + ctx->end_xact = false; This patch reverts many of the changes of the previous commit which tried to fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but without much explanation of that in the commit message. Its only user, WalSndUpdateProgress(), is probably making a wrong assumption as well. * We don't have a mechanism to get the ack for any LSN other than end * xact LSN from the downstream. So, we track lag only for end of * transaction LSN. IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is sent downstream through a keep alive message. Downstream may acknowledge this LSN. So we do get ack for any LSN, not just commit LSN. So I propose removing end_xact as well. + + /* + * We don't want to try sending a keepalive message after processing each + * change as that can have overhead. Tests revealed that there is no + * noticeable overhead in doing it after continuously processing 100 or so + * changes. + */ +#define CHANGES_THRESHOLD 100 I think a time based threashold makes more sense. What if the timeout was nearing and those 100 changes just took little more time causing a timeout? We already have a time based threashold in WalSndKeepaliveIfNecessary(). And that function is invoked after reading every WAL record in WalSndLoop(). So it does not look like it's an expensive function. If it is expensive we might want to worry about WalSndLoop as well. Does it make more sense to remove this threashold? + + /* + * After continuously processing CHANGES_THRESHOLD changes, we + * try to send a keepalive message if required. + */ + if (++changes_count >= CHANGES_THRESHOLD) + { + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); + changes_count = 0; + } +} + On the other thread, I mentioned that we don't have a TAP test for it. I agree with Amit's opinion there that it's hard to create a test which will timeout everywhere. I think what we need is a way to control the time required for decoding a transaction. A rough idea is to induce a small sleep after decoding every change. The amount of sleep * number of changes will help us estimate and control the amount of time taken to decode a transaction. Then we create a transaction which will take longer than the timeout threashold to decode. But that's a significant code. I don't think PostgreSQL has a facility to induce a delay at a particular place in the code. -- Best Wishes, Ashutosh Bapat
On Fri, Jan 6, 2023 at 12:35 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > + > + /* > + * We don't want to try sending a keepalive message after processing each > + * change as that can have overhead. Tests revealed that there is no > + * noticeable overhead in doing it after continuously processing 100 or so > + * changes. > + */ > +#define CHANGES_THRESHOLD 100 > > I think a time based threashold makes more sense. What if the timeout was > nearing and those 100 changes just took little more time causing a timeout? We > already have a time based threashold in WalSndKeepaliveIfNecessary(). And that > function is invoked after reading every WAL record in WalSndLoop(). So it does > not look like it's an expensive function. If it is expensive we might want to > worry about WalSndLoop as well. Does it make more sense to remove this > threashold? > We have previously tried this for every change [1] and it brings noticeable overhead. In fact, even doing it for every 10 changes also had some overhead which is why we reached this threshold number. I don't think it can lead to timeout due to skipping changes but sure if we see any such report we can further fine-tune this setting or will try to make it time-based but for now I feel it would be safe to use this threshold. > + > + /* > + * After continuously processing CHANGES_THRESHOLD changes, we > + * try to send a keepalive message if required. > + */ > + if (++changes_count >= CHANGES_THRESHOLD) > + { > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); > + changes_count = 0; > + } > +} > + > > On the other thread, I mentioned that we don't have a TAP test for it. > I agree with > Amit's opinion there that it's hard to create a test which will timeout > everywhere. I think what we need is a way to control the time required for > decoding a transaction. > > A rough idea is to induce a small sleep after decoding every change. The amount > of sleep * number of changes will help us estimate and control the amount of > time taken to decode a transaction. Then we create a transaction which will > take longer than the timeout threashold to decode. But that's a > significant code. I > don't think PostgreSQL has a facility to induce a delay at a particular place > in the code. > Yeah, I don't know how to induce such a delay while decoding changes. One more thing, I think it would be better to expose a new callback API via reorder buffer as suggested previously [2] similar to other reorder buffer APIs instead of directly using reorderbuffer API to invoke plugin API. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com -- With Regards, Amit Kapila.
On Fri, Jan 6, 2023 at 15:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > Hi Wang, > Thanks for working on this. One of our customer faced a similar > situation when running BDR with PostgreSQL. > > I tested your patch and it solves the problem. > > Please find some review comments below Thanks for your testing and comments. > +/* > + * Helper function for ReorderBufferProcessTXN for updating progress. > + */ > +static inline void > +ReorderBufferUpdateProgress(ReorderBuffer *rb, ReorderBufferTXN *txn, > + ReorderBufferChange *change) > +{ > + LogicalDecodingContext *ctx = rb->private_data; > + static int changes_count = 0; > > It's not easy to know that a variable is static when reading the code which > uses it. So it's easy to interpret code wrong. I would probably track it > through logical decoding context itself OR through a global variable like other > places where we track the last timestamps. But there's more below on this. I'm not sure if we need to add global variables or member variables for a cumulative count that is only used here. How would you feel if I add some comments when declaring this static variable? > + > + if (!ctx->update_progress) > + return; > + > + Assert(!ctx->fast_forward); > + > + /* set output state */ > + ctx->accept_writes = false; > + ctx->write_xid = txn->xid; > + ctx->write_location = change->lsn; > + ctx->end_xact = false; > > This patch reverts many of the changes of the previous commit which tried to > fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but > without much explanation of that in the commit message. Its only user, > WalSndUpdateProgress(), is probably making a wrong assumption as well. > > * We don't have a mechanism to get the ack for any LSN other than end > * xact LSN from the downstream. So, we track lag only for end of > * transaction LSN. > > IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is > sent downstream through a keep alive message. Downstream may > acknowledge this > LSN. So we do get ack for any LSN, not just commit LSN. > > So I propose removing end_xact as well. We didn't track the lag during a transaction because it could make the calculations of lag functionality inaccurate. If we track every lsn, it could fail to record important lsn information because of WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress). Please see details in [1] and [2]. Regards, Wang Wei [1] - https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/OS3PR01MB627514AE0B3040D8F55A68B99EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com
On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for your comments. > One more thing, I think it would be better to expose a new callback > API via reorder buffer as suggested previously [2] similar to other > reorder buffer APIs instead of directly using reorderbuffer API to > invoke plugin API. Yes, I agree. I think it would be better to add a new callback API on the HEAD. So, I improved the fix approach: Introduce a new optional callback to update the process. This callback function is invoked at the end inside the main loop of the function ReorderBufferProcessTXN() for each change. In this way, I think it seems that similar timeout problems could be avoided. BTW, I did the performance test for this patch. When running the SQL that reproduces the problem (refresh the materialized view in sync logical replication mode), the running time of new function pgoutput_update_progress is less than 0.1% of the total time. I think this result looks OK. Attach the new patch. Regards, Wang Wei
Вложения
On Mon, Jan 9, 2023 at 4:08 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Fri, Jan 6, 2023 at 15:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I'm not sure if we need to add global variables or member variables for a > cumulative count that is only used here. How would you feel if I add some > comments when declaring this static variable? I see WalSndUpdateProgress::sendTime is static already. So this seems fine. A comment will help sure. > > > + > > + if (!ctx->update_progress) > > + return; > > + > > + Assert(!ctx->fast_forward); > > + > > + /* set output state */ > > + ctx->accept_writes = false; > > + ctx->write_xid = txn->xid; > > + ctx->write_location = change->lsn; > > + ctx->end_xact = false; > > > > This patch reverts many of the changes of the previous commit which tried to > > fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but > > without much explanation of that in the commit message. Its only user, > > WalSndUpdateProgress(), is probably making a wrong assumption as well. > > > > * We don't have a mechanism to get the ack for any LSN other than end > > * xact LSN from the downstream. So, we track lag only for end of > > * transaction LSN. > > > > IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is > > sent downstream through a keep alive message. Downstream may > > acknowledge this > > LSN. So we do get ack for any LSN, not just commit LSN. > > > > So I propose removing end_xact as well. > > We didn't track the lag during a transaction because it could make the > calculations of lag functionality inaccurate. If we track every lsn, it could > fail to record important lsn information because of > WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress). > Please see details in [1] and [2]. LagTrackerRead() interpolates to reduce the inaccuracy. I don't understand why we need to track the end LSN only. But I don't think that affects this fix. So I am fine if we want to leave end_xact there. -- Best Wishes, Ashutosh Bapat
On Wed, Jan 11, 2023 at 4:11 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Thanks for your comments. > > > One more thing, I think it would be better to expose a new callback > > API via reorder buffer as suggested previously [2] similar to other > > reorder buffer APIs instead of directly using reorderbuffer API to > > invoke plugin API. > > Yes, I agree. I think it would be better to add a new callback API on the HEAD. > So, I improved the fix approach: > Introduce a new optional callback to update the process. This callback function > is invoked at the end inside the main loop of the function > ReorderBufferProcessTXN() for each change. In this way, I think it seems that > similar timeout problems could be avoided. I am a bit worried about the indirections that the wrappers and hooks create. Output plugins call OutputPluginUpdateProgress() in callbacks but I don't see why ReorderBufferProcessTXN() needs a callback to call OutputPluginUpdateProgress. I don't think output plugins are going to do anything special with that callback than just call OutputPluginUpdateProgress. Every output plugin will need to implement it and if they do not they will face the timeout problem. That would be unnecessary. Instead ReorderBufferUpdateProgress() in your first patch was more direct and readable. That way the fix works for any output plugin. In fact, I am wondering whether we could have a call in ReorderBufferProcessTxn() at the end of transaction (commit/prepare/commit prepared/abort prepared) instead of the corresponding output plugin callbacks calling OutputPluginUpdateProgress(). -- Best Wishes, Ashutosh Bapat
On Mon, Jan 16, 2023 at 10:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Jan 11, 2023 at 4:11 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > Thanks for your comments. > > > > > One more thing, I think it would be better to expose a new callback > > > API via reorder buffer as suggested previously [2] similar to other > > > reorder buffer APIs instead of directly using reorderbuffer API to > > > invoke plugin API. > > > > Yes, I agree. I think it would be better to add a new callback API on the HEAD. > > So, I improved the fix approach: > > Introduce a new optional callback to update the process. This callback function > > is invoked at the end inside the main loop of the function > > ReorderBufferProcessTXN() for each change. In this way, I think it seems that > > similar timeout problems could be avoided. > > I am a bit worried about the indirections that the wrappers and hooks > create. Output plugins call OutputPluginUpdateProgress() in callbacks > but I don't see why ReorderBufferProcessTXN() needs a callback to > call OutputPluginUpdateProgress. > Yeah, I think we can do it as we are doing the previous approach but we need an additional wrapper (update_progress_cb_wrapper()) as the current patch has so that we can add error context information. This is similar to why we have a wrapper for all other callbacks like change_cb_wrapper. -- With Regards, Amit Kapila.
On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I am a bit worried about the indirections that the wrappers and hooks > > create. Output plugins call OutputPluginUpdateProgress() in callbacks > > but I don't see why ReorderBufferProcessTXN() needs a callback to > > call OutputPluginUpdateProgress. > > > > Yeah, I think we can do it as we are doing the previous approach but > we need an additional wrapper (update_progress_cb_wrapper()) as the > current patch has so that we can add error context information. This > is similar to why we have a wrapper for all other callbacks like > change_cb_wrapper. > Ultimately OutputPluginUpdateProgress() will be called - which in turn will call ctx->update_progress. I don't see wrappers around OutputPluginWrite or OutputPluginPrepareWrite. But I see that those two are called always from output plugin, so indirectly those are called through a wrapper. I also see that update_progress_cb_wrapper() is similar, as far as wrapper is concerned, to ReorderBufferUpdateProgress() in the earlier patch. ReorderBufferUpdateProgress() looks more readable than the wrapper. If we want to keep the wrapper at least we should use a different variable name. update_progress is also there LogicalDecodingContext and will be indirectly called from ReorderBuffer::update_progress. Somebody might think that there's some recursion involved there. That's a mighty confusion. -- Best Wishes, Ashutosh Bapat
On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > I am a bit worried about the indirections that the wrappers and hooks > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks > > > but I don't see why ReorderBufferProcessTXN() needs a callback to > > > call OutputPluginUpdateProgress. > > > > > > > Yeah, I think we can do it as we are doing the previous approach but > > we need an additional wrapper (update_progress_cb_wrapper()) as the > > current patch has so that we can add error context information. This > > is similar to why we have a wrapper for all other callbacks like > > change_cb_wrapper. > > > > Ultimately OutputPluginUpdateProgress() will be called - which in turn > will call ctx->update_progress. > No, update_progress_cb_wrapper() should directly call ctx->update_progress(). The key reason to have a update_progress_cb_wrapper() is that it allows us to add error context information (see the usage of output_plugin_error_callback). -- With Regards, Amit Kapila.
On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > I am a bit worried about the indirections that the wrappers and hooks > > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks > > > > but I don't see why ReorderBufferProcessTXN() needs a callback to > > > > call OutputPluginUpdateProgress. > > > > > > > > > > Yeah, I think we can do it as we are doing the previous approach but > > > we need an additional wrapper (update_progress_cb_wrapper()) as the > > > current patch has so that we can add error context information. This > > > is similar to why we have a wrapper for all other callbacks like > > > change_cb_wrapper. > > > > > > > Ultimately OutputPluginUpdateProgress() will be called - which in turn > > will call ctx->update_progress. > > > > No, update_progress_cb_wrapper() should directly call > ctx->update_progress(). The key reason to have a > update_progress_cb_wrapper() is that it allows us to add error context > information (see the usage of output_plugin_error_callback). I think it makes sense. This also avoids the need for every output plugin to implement the callback. So I tried to improve the patch based on this approach. And I tried to add some comments for this new callback to distinguish it from ctx->update_progress. Attach the new patch. Regards, Wang Wei
Вложения
On Wed, Jan 18, 2023 at 1:49 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > I am a bit worried about the indirections that the wrappers and hooks > > > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks > > > > > but I don't see why ReorderBufferProcessTXN() needs a callback to > > > > > call OutputPluginUpdateProgress. > > > > > > > > > > > > > Yeah, I think we can do it as we are doing the previous approach but > > > > we need an additional wrapper (update_progress_cb_wrapper()) as the > > > > current patch has so that we can add error context information. This > > > > is similar to why we have a wrapper for all other callbacks like > > > > change_cb_wrapper. > > > > > > > > > > Ultimately OutputPluginUpdateProgress() will be called - which in turn > > > will call ctx->update_progress. > > > > > > > No, update_progress_cb_wrapper() should directly call > > ctx->update_progress(). The key reason to have a > > update_progress_cb_wrapper() is that it allows us to add error context > > information (see the usage of output_plugin_error_callback). > > I think it makes sense. This also avoids the need for every output plugin to > implement the callback. So I tried to improve the patch based on this approach. > > And I tried to add some comments for this new callback to distinguish it from > ctx->update_progress. Comments don't help when using cscope or some such code browsing tool. Better to use a different variable name. -- Best Wishes, Ashutosh Bapat
On Wed, Jan 18, 2023 at 5:37 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Jan 18, 2023 at 1:49 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > I am a bit worried about the indirections that the wrappers and hooks > > > > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks > > > > > > but I don't see why ReorderBufferProcessTXN() needs a callback to > > > > > > call OutputPluginUpdateProgress. > > > > > > > > > > > > > > > > Yeah, I think we can do it as we are doing the previous approach but > > > > > we need an additional wrapper (update_progress_cb_wrapper()) as the > > > > > current patch has so that we can add error context information. This > > > > > is similar to why we have a wrapper for all other callbacks like > > > > > change_cb_wrapper. > > > > > > > > > > > > > Ultimately OutputPluginUpdateProgress() will be called - which in turn > > > > will call ctx->update_progress. > > > > > > > > > > No, update_progress_cb_wrapper() should directly call > > > ctx->update_progress(). The key reason to have a > > > update_progress_cb_wrapper() is that it allows us to add error context > > > information (see the usage of output_plugin_error_callback). > > > > I think it makes sense. This also avoids the need for every output plugin to > > implement the callback. So I tried to improve the patch based on this approach. > > > > And I tried to add some comments for this new callback to distinguish it from > > ctx->update_progress. > > Comments don't help when using cscope or some such code browsing tool. > Better to use a different variable name. > + /* + * Callback to be called when updating progress during sending data of a + * transaction (and its subtransactions) to the output plugin. + */ + ReorderBufferUpdateProgressCB update_progress; Are you suggesting changing the name of the above variable? If so, how about apply_progress, progress, or updateprogress? If you don't like any of these then feel free to suggest something else. If we change the variable name then accordingly, we need to update ReorderBufferUpdateProgressCB as well. -- With Regards, Amit Kapila.
On Wed, Jan 18, 2023 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > + */ > + ReorderBufferUpdateProgressCB update_progress; > > Are you suggesting changing the name of the above variable? If so, how > about apply_progress, progress, or updateprogress? If you don't like > any of these then feel free to suggest something else. If we change > the variable name then accordingly, we need to update > ReorderBufferUpdateProgressCB as well. > I would liked to have all the callback names renamed with prefix "rbcb_xxx" so that they have very less chances of conflicting with similar names in the code base. But it's probably late to do that :). How are update_txn_progress since the CB is supposed to be used only within a transaction? or update_progress_txn? update_progress_cb_wrapper needs a change of name as well. -- Best Wishes, Ashutosh Bapat
On Thu, Jan 19, 2023 at 4:13 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Jan 18, 2023 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > + */ > > + ReorderBufferUpdateProgressCB update_progress; > > > > Are you suggesting changing the name of the above variable? If so, how > > about apply_progress, progress, or updateprogress? If you don't like > > any of these then feel free to suggest something else. If we change > > the variable name then accordingly, we need to update > > ReorderBufferUpdateProgressCB as well. > > > > I would liked to have all the callback names renamed with prefix > "rbcb_xxx" so that they have very less chances of conflicting with > similar names in the code base. But it's probably late to do that :). > > How are update_txn_progress since the CB is supposed to be used only > within a transaction? or update_progress_txn? > Personally, I would prefer 'apply_progress' as it would be similar to a few other callbacks like apply_change, apply_truncate, or as is proposed by patch update_progress again because it is similar to existing callbacks like commit_prepared. If you and others don't like any of those then we can go for 'update_progress_txn' as well. Anybody else has an opinion on this? -- With Regards, Amit Kapila.
Here are some review comments for patch v3-0001. ====== Commit message 1. The problem is when there is a DDL in a transaction that generates lots of temporary data due to rewrite rules, these temporary data will not be processed by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for DML had no impact on this case. ~ 1a. IMO this comment needs to give a bit of background about the original problem here, rather than just starting with "The problem is" which is describing the flaws of the previous fix. ~ 1b. "pgoutput - plugin" -> "pgoutput plugin" ?? ~~~ 2. To fix this, we introduced a new ReorderBuffer callback - 'ReorderBufferUpdateProgressCB'. This callback is called to try to update the process after each change has been processed during sending data of a transaction (and its subtransactions) to the output plugin. IIUC it's not really "after each change" - shouldn't this comment mention something about the CHANGES_THRESHOLD 100? ====== src/backend/replication/logical/logical.c 3. forward declaration +/* update progress callback */ +static void update_progress_cb_wrapper(ReorderBuffer *cache, + ReorderBufferTXN *txn, + ReorderBufferChange *change); I felt this function wrapper name was a bit misleading... AFAIK every other wrapper really does just wrap their respective functions. But this one seems a bit different because it calls the wrapped function ONLY if some threshold is exceeded. IMO maybe this function could have some name that conveys this better: e.g. update_progress_cb_wrapper_with_threshold ~~~ 4. update_progress_cb_wrapper +/* + * Update progress callback + * + * Try to update progress and send a keepalive message if too many changes were + * processed when processing txn. + * + * For a large transaction, if we don't send any change to the downstream for a + * long time (exceeds the wal_receiver_timeout of standby) then it can timeout. + * This can happen when all or most of the changes are either not published or + * got filtered out. + */ SUGGESTION (instead of the "Try to update" sentence) Send a keepalive message whenever more than <CHANGES_THRESHOLD> changes are encountered while processing a transaction. ~~~ 5. +static void +update_progress_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + ReorderBufferChange *change) +{ + LogicalDecodingContext *ctx = cache->private_data; + LogicalErrorCallbackState state; + ErrorContextCallback errcallback; + static int changes_count = 0; /* Static variable used to accumulate + * the number of changes while + * processing txn. */ + IMO this may be more readable if the static 'changes_count' local var was declared first and separated from the other vars by a blank line. ~~~ 6. + /* + * We don't want to try sending a keepalive message after processing each + * change as that can have overhead. Tests revealed that there is no + * noticeable overhead in doing it after continuously processing 100 or so + * changes. + */ +#define CHANGES_THRESHOLD 100 6a. I think it might be better to define this right at the top of the function adjacent to the 'changes_count' variable (e.g. a bit like the original HEAD code looked) ~ 6b. SUGGESTION (for the comment) Sending keepalive messages after every change has some overhead, but testing showed there is no noticeable overhead if keepalive is only sent after every ~100 changes. ~~~ 7. + + /* + * After continuously processing CHANGES_THRESHOLD changes, we + * try to send a keepalive message if required. + */ + if (++changes_count >= CHANGES_THRESHOLD) + { + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); + changes_count = 0; + } + 7a. SUGGESTION (for comment) Send a keepalive message after every CHANGES_THRESHOLD changes. ~ 7b. Would it be neater to just call OutputPluginUpdateProgress here instead? e.g. BEFORE ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); AFTER OutputPluginUpdateProgress(ctx, false); ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jan 20, 2023 at 7:40 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v3-0001. > > ====== > src/backend/replication/logical/logical.c > > 3. forward declaration > > +/* update progress callback */ > +static void update_progress_cb_wrapper(ReorderBuffer *cache, > + ReorderBufferTXN *txn, > + ReorderBufferChange *change); > > I felt this function wrapper name was a bit misleading... AFAIK every > other wrapper really does just wrap their respective functions. But > this one seems a bit different because it calls the wrapped function > ONLY if some threshold is exceeded. IMO maybe this function could have > some name that conveys this better: > > e.g. update_progress_cb_wrapper_with_threshold > I am wondering whether it would be better to move the threshold logic to the caller. Previously this logic was inside the function because it was being invoked from multiple places but now that won't be the case. Also, then your concern about the name would also be addressed. > > ~ > > 7b. > Would it be neater to just call OutputPluginUpdateProgress here instead? > > e.g. > BEFORE > ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); > AFTER > OutputPluginUpdateProgress(ctx, false); > We already check whether ctx->update_progress is defined or not which is the only extra job done by OutputPluginUpdateProgress but probably we can consolidate the checks and directly invoke OutputPluginUpdateProgress. -- With Regards, Amit Kapila.
On Fri, Jan 20, 2023 at 3:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 20, 2023 at 7:40 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for patch v3-0001. > > > > ====== > > src/backend/replication/logical/logical.c > > > > 3. forward declaration > > > > +/* update progress callback */ > > +static void update_progress_cb_wrapper(ReorderBuffer *cache, > > + ReorderBufferTXN *txn, > > + ReorderBufferChange *change); > > > > I felt this function wrapper name was a bit misleading... AFAIK every > > other wrapper really does just wrap their respective functions. But > > this one seems a bit different because it calls the wrapped function > > ONLY if some threshold is exceeded. IMO maybe this function could have > > some name that conveys this better: > > > > e.g. update_progress_cb_wrapper_with_threshold > > > > I am wondering whether it would be better to move the threshold logic > to the caller. Previously this logic was inside the function because > it was being invoked from multiple places but now that won't be the > case. Also, then your concern about the name would also be addressed. > > > > > ~ > > > > 7b. > > Would it be neater to just call OutputPluginUpdateProgress here instead? > > > > e.g. > > BEFORE > > ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); > > AFTER > > OutputPluginUpdateProgress(ctx, false); > > > > We already check whether ctx->update_progress is defined or not which > is the only extra job done by OutputPluginUpdateProgress but probably > we can consolidate the checks and directly invoke > OutputPluginUpdateProgress. > Yes, I saw that, but I thought it was better to keep the early exit from update_progress_cb_wrapper, so incurring just one additional boolean check for every 100 changes was not anything to worry about. ------ Kind Regards, Peter Smith. Fujitsu Australia.
On Thu, Jan 19, 2023 at 19:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jan 19, 2023 at 4:13 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Wed, Jan 18, 2023 at 6:00 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > + */ > > > + ReorderBufferUpdateProgressCB update_progress; > > > > > > Are you suggesting changing the name of the above variable? If so, how > > > about apply_progress, progress, or updateprogress? If you don't like > > > any of these then feel free to suggest something else. If we change > > > the variable name then accordingly, we need to update > > > ReorderBufferUpdateProgressCB as well. > > > > > > > I would liked to have all the callback names renamed with prefix > > "rbcb_xxx" so that they have very less chances of conflicting with > > similar names in the code base. But it's probably late to do that :). > > > > How are update_txn_progress since the CB is supposed to be used only > > within a transaction? or update_progress_txn? > > > > Personally, I would prefer 'apply_progress' as it would be similar to > a few other callbacks like apply_change, apply_truncate, or as is > proposed by patch update_progress again because it is similar to > existing callbacks like commit_prepared. If you and others don't like > any of those then we can go for 'update_progress_txn' as well. Anybody > else has an opinion on this? I think 'update_progress_txn' might be better. Because I think this name seems to make it easier to know that this callback is used to update process when processing txn. So, I rename it to 'update_progress_txn'. I have addressed all the comments and here is the new version patch. Regards, Wang Wei
Вложения
On Fri, Jan 20, 2023 at 12:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jan 20, 2023 at 7:40 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for patch v3-0001. > > > > ====== > > src/backend/replication/logical/logical.c > > > > 3. forward declaration > > > > +/* update progress callback */ > > +static void update_progress_cb_wrapper(ReorderBuffer *cache, > > + ReorderBufferTXN *txn, > > + ReorderBufferChange *change); > > > > I felt this function wrapper name was a bit misleading... AFAIK every > > other wrapper really does just wrap their respective functions. But > > this one seems a bit different because it calls the wrapped function > > ONLY if some threshold is exceeded. IMO maybe this function could have > > some name that conveys this better: > > > > e.g. update_progress_cb_wrapper_with_threshold > > > > I am wondering whether it would be better to move the threshold logic > to the caller. Previously this logic was inside the function because > it was being invoked from multiple places but now that won't be the > case. Also, then your concern about the name would also be addressed. Agree. Moved the threshold logic to the function ReorderBufferProcessTXN. > > > > ~ > > > > 7b. > > Would it be neater to just call OutputPluginUpdateProgress here instead? > > > > e.g. > > BEFORE > > ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); > > AFTER > > OutputPluginUpdateProgress(ctx, false); > > > > We already check whether ctx->update_progress is defined or not which > is the only extra job done by OutputPluginUpdateProgress but probably > we can consolidate the checks and directly invoke > OutputPluginUpdateProgress. Changed. Invoke the function OutputPluginUpdateProgress directly in the new callback. Regards, Wang Wei
On Fri, Jan 20, 2023 at 10:10 AM Peter Smith <smithpb2250@gmail.com> wrote: > Here are some review comments for patch v3-0001. Thanks for your comments. > ====== > Commit message > > 1. > The problem is when there is a DDL in a transaction that generates lots of > temporary data due to rewrite rules, these temporary data will not be > processed > by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for DML had no > impact on this case. > > ~ > > 1a. > IMO this comment needs to give a bit of background about the original > problem here, rather than just starting with "The problem is" which is > describing the flaws of the previous fix. Added some related message. > ~ > > 1b. > "pgoutput - plugin" -> "pgoutput plugin" ?? Changed. > ~~~ > > 2. > > To fix this, we introduced a new ReorderBuffer callback - > 'ReorderBufferUpdateProgressCB'. This callback is called to try to update the > process after each change has been processed during sending data of a > transaction (and its subtransactions) to the output plugin. > > IIUC it's not really "after each change" - shouldn't this comment > mention something about the CHANGES_THRESHOLD 100? Changed. > ~~~ > > 4. update_progress_cb_wrapper > > +/* > + * Update progress callback > + * > + * Try to update progress and send a keepalive message if too many changes > were > + * processed when processing txn. > + * > + * For a large transaction, if we don't send any change to the downstream for a > + * long time (exceeds the wal_receiver_timeout of standby) then it can > timeout. > + * This can happen when all or most of the changes are either not published or > + * got filtered out. > + */ > > SUGGESTION (instead of the "Try to update" sentence) > Send a keepalive message whenever more than <CHANGES_THRESHOLD> > changes are encountered while processing a transaction. Since it's possible that keep-alive messages won't be sent even if the threshold is reached (see function WalSndKeepaliveIfNecessary), I thought it might be better to use "try to". And rewrote the comments here because the threshold logic is moved to the function ReorderBufferProcessTXN. > ~~~ > > 5. > > +static void > +update_progress_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + ReorderBufferChange *change) > +{ > + LogicalDecodingContext *ctx = cache->private_data; > + LogicalErrorCallbackState state; > + ErrorContextCallback errcallback; > + static int changes_count = 0; /* Static variable used to accumulate > + * the number of changes while > + * processing txn. */ > + > > IMO this may be more readable if the static 'changes_count' local var > was declared first and separated from the other vars by a blank line. Changed. > ~~~ > > 6. > > + /* > + * We don't want to try sending a keepalive message after processing each > + * change as that can have overhead. Tests revealed that there is no > + * noticeable overhead in doing it after continuously processing 100 or so > + * changes. > + */ > +#define CHANGES_THRESHOLD 100 > > 6a. > I think it might be better to define this right at the top of the > function adjacent to the 'changes_count' variable (e.g. a bit like the > original HEAD code looked) Changed. > ~ > > 6b. > SUGGESTION (for the comment) > Sending keepalive messages after every change has some overhead, but > testing showed there is no noticeable overhead if keepalive is only > sent after every ~100 changes. Changed. > ~~~ > > 7. > > + > + /* > + * After continuously processing CHANGES_THRESHOLD changes, we > + * try to send a keepalive message if required. > + */ > + if (++changes_count >= CHANGES_THRESHOLD) > + { > + ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false); > + changes_count = 0; > + } > + > > 7a. > SUGGESTION (for comment) > Send a keepalive message after every CHANGES_THRESHOLD changes. Changed. Regards, Wang Wei
Here are my review comments for patch v4-0001 ====== General 1. It makes no real difference, but I was wondering about: "update txn progress" versus "update progress txn" I thought that the first way sounds more natural. YMMV. If you change this then there is impact for the typedef, function names, comments, member names: ReorderBufferUpdateTxnProgressCB --> ReorderBufferUpdateProgressTxnCB “/* update progress txn callback */” --> “/* update txn progress callback */” update_progress_txn_cb_wrapper --> update_txn_progress_cb_wrapper updated_progress_txn --> update_txn_progress ====== Commit message 2. The problem is when there is a DDL in a transaction that generates lots of temporary data due to rewrite rules, these temporary data will not be processed by the pgoutput plugin. The previous commit (f95d53e) only fixed timeouts caused by filtering out changes in pgoutput. Therefore, the previous fix for DML had no impact on this case. ~ IMO this still some rewording to say up-front what the the actual problem -- i.e. an avoidable timeout occuring. SUGGESTION (or something like this...) When there is a DDL in a transaction that generates lots of temporary data due to rewrite rules, this temporary data will not be processed by the pgoutput plugin. This means it is possible for a timeout to occur if a sufficiently long time elapses since the last pgoutput message. A previous commit (f95d53e) fixed a similar scenario in this area, but that only fixed timeouts for DML going through pgoutput, so it did not address this DDL timeout case. ====== src/backend/replication/logical/logical.c 3. update_progress_txn_cb_wrapper +/* + * Update progress callback while processing a transaction. + * + * Try to update progress and send a keepalive message during sending data of a + * transaction (and its subtransactions) to the output plugin. + * + * For a large transaction, if we don't send any change to the downstream for a + * long time (exceeds the wal_receiver_timeout of standby) then it can timeout. + * This can happen when all or most of the changes are either not published or + * got filtered out. + */ +static void +update_progress_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + ReorderBufferChange *change) Simplify the "Try to..." paragraph. And other part should also mention about DDL. SUGGESTION Try send a keepalive message during transaction processing. This is done because if we don't send any change to the downstream for a long time (exceeds the wal_receiver_timeout of standby), then it can timeout. This can happen for large DDL, or for large transactions when all or most of the changes are either not published or got filtered out. ====== .../replication/logical/reorderbuffer.c 4. ReorderBufferProcessTXN @@ -2105,6 +2105,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, PG_TRY(); { + /* + * Static variable used to accumulate the number of changes while + * processing txn. + */ + static int changes_count = 0; + + /* + * Sending keepalive messages after every change has some overhead, but + * testing showed there is no noticeable overhead if keepalive is only + * sent after every ~100 changes. + */ +#define CHANGES_THRESHOLD 100 + IMO these can be relocated to be declared/defined inside the "while" loop -- i.e. closer to where they are being used. ~~~ 5. + if (++changes_count >= CHANGES_THRESHOLD) + { + rb->update_progress_txn(rb, txn, change); + changes_count = 0; + } When there is no update_progress function this code is still incurring some small additional overhead for incrementing and testing the THRESHOLD every time, and also needlessly calling to the wrapper every 100x. This overhead could be avoided with a simpler up-front check like shown below. OTOH, maybe the overhead is insignificant enough that just leaving the curent code is neater? LogicalDecodingContext *ctx = rb->private_data; ... if (ctx->update_progress_txn && (++changes_count >= CHANGES_THRESHOLD)) { rb->update_progress_txn(rb, txn, change); changes_count = 0; } ------ Kind Reagrds, Peter Smith. Fujitsu Australia
On Mon, Jan 23, 2023 at 6:21 AM Peter Smith <smithpb2250@gmail.com> wrote: > > 1. > > It makes no real difference, but I was wondering about: > "update txn progress" versus "update progress txn" > Yeah, I think we can go either way but I still prefer "update progress txn" as that is more closer to LogicalOutputPluginWriterUpdateProgress callback name. > > 5. > > + if (++changes_count >= CHANGES_THRESHOLD) > + { > + rb->update_progress_txn(rb, txn, change); > + changes_count = 0; > + } > > When there is no update_progress function this code is still incurring > some small additional overhead for incrementing and testing the > THRESHOLD every time, and also needlessly calling to the wrapper every > 100x. This overhead could be avoided with a simpler up-front check > like shown below. OTOH, maybe the overhead is insignificant enough > that just leaving the curent code is neater? > As far as built-in logical replication is concerned, it will be defined and I don't know if the overhead will be significant enough in this case. Also, one can say that for the cases it is defined, we are adding this check multiple times (it is already checked inside OutputPluginUpdateProgress). So, I would prefer a neat code here. -- With Regards, Amit Kapila.
On Monday, January 23, 2023 8:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for patch v4-0001 > ====== > Commit message > > 2. > > The problem is when there is a DDL in a transaction that generates lots of > temporary data due to rewrite rules, these temporary data will not be processed > by the pgoutput plugin. The previous commit (f95d53e) only fixed timeouts > caused by filtering out changes in pgoutput. Therefore, the previous fix for DML > had no impact on this case. > > ~ > > IMO this still some rewording to say up-front what the the actual problem -- i.e. > an avoidable timeout occuring. > > SUGGESTION (or something like this...) > > When there is a DDL in a transaction that generates lots of temporary data due > to rewrite rules, this temporary data will not be processed by the pgoutput > plugin. This means it is possible for a timeout to occur if a sufficiently long time > elapses since the last pgoutput message. A previous commit (f95d53e) fixed a > similar scenario in this area, but that only fixed timeouts for DML going through > pgoutput, so it did not address this DDL timeout case. Thanks, I changed the commit message as suggested. > ====== > src/backend/replication/logical/logical.c > > 3. update_progress_txn_cb_wrapper > > +/* > + * Update progress callback while processing a transaction. > + * > + * Try to update progress and send a keepalive message during sending > +data of a > + * transaction (and its subtransactions) to the output plugin. > + * > + * For a large transaction, if we don't send any change to the > +downstream for a > + * long time (exceeds the wal_receiver_timeout of standby) then it can timeout. > + * This can happen when all or most of the changes are either not > +published or > + * got filtered out. > + */ > +static void > +update_progress_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN > *txn, > + ReorderBufferChange *change) > > Simplify the "Try to..." paragraph. And other part should also mention about DDL. > > SUGGESTION > > Try send a keepalive message during transaction processing. > > This is done because if we don't send any change to the downstream for a long > time (exceeds the wal_receiver_timeout of standby), then it can timeout. This can > happen for large DDL, or for large transactions when all or most of the changes > are either not published or got filtered out. Changed. > ====== > .../replication/logical/reorderbuffer.c > > 4. ReorderBufferProcessTXN > > @@ -2105,6 +2105,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > > PG_TRY(); > { > + /* > + * Static variable used to accumulate the number of changes while > + * processing txn. > + */ > + static int changes_count = 0; > + > + /* > + * Sending keepalive messages after every change has some overhead, but > + * testing showed there is no noticeable overhead if keepalive is only > + * sent after every ~100 changes. > + */ > +#define CHANGES_THRESHOLD 100 > + > > IMO these can be relocated to be declared/defined inside the "while" > loop -- i.e. closer to where they are being used. Moved into the while loop. Attach the new version patch which addressed above comments. Also attach a simple script which use "refresh matview" to reproduce this timeout problem just in case some one want to try to reproduce this. Best regards, Hou zj
Вложения
Hi Hou-san, Here are my review comments for v5-0001. ====== src/backend/replication/logical/reorderbuffer.c 1. @@ -2446,6 +2452,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, elog(ERROR, "tuplecid value in changequeue"); break; } + + /* + * Sending keepalive messages after every change has some overhead, but + * testing showed there is no noticeable overhead if keepalive is only + * sent after every ~100 changes. + */ +#define CHANGES_THRESHOLD 100 + + /* + * Try to send a keepalive message after every CHANGES_THRESHOLD + * changes. + */ + if (++changes_count >= CHANGES_THRESHOLD) + { + rb->update_progress_txn(rb, txn, change); + changes_count = 0; + } I noticed you put the #define adjacent to the only usage of it, instead of with the other variable declaration like it was before. Probably it is better how you have done it, but: 1a. The comment indentation is incorrect. ~ 1b. Since the #define is adjacent to its only usage IMO now the 2nd comment is redundant. So the code can just say /* * Sending keepalive messages after every change has some overhead, but * testing showed there is no noticeable overhead if keepalive is only * sent after every ~100 changes. */ #define CHANGES_THRESHOLD 100 if (++changes_count >= CHANGES_THRESHOLD) { rb->update_progress_txn(rb, txn, change); changes_count = 0; } ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tues, Jan 24, 2023 at 8:28 AM Peter Smith <smithpb2250@gmail.com> wrote: > Hi Hou-san, Here are my review comments for v5-0001. Thanks for your comments. > ====== > src/backend/replication/logical/reorderbuffer.c > > 1. > @@ -2446,6 +2452,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > elog(ERROR, "tuplecid value in changequeue"); > break; > } > + > + /* > + * Sending keepalive messages after every change has some overhead, but > + * testing showed there is no noticeable overhead if keepalive is only > + * sent after every ~100 changes. > + */ > +#define CHANGES_THRESHOLD 100 > + > + /* > + * Try to send a keepalive message after every CHANGES_THRESHOLD > + * changes. > + */ > + if (++changes_count >= CHANGES_THRESHOLD) > + { > + rb->update_progress_txn(rb, txn, change); > + changes_count = 0; > + } > > I noticed you put the #define adjacent to the only usage of it, > instead of with the other variable declaration like it was before. > Probably it is better how you have done it, but: > > 1a. > The comment indentation is incorrect. > > ~ > > 1b. > Since the #define is adjacent to its only usage IMO now the 2nd > comment is redundant. So the code can just say > > /* > * Sending keepalive messages after every change has some > overhead, but > * testing showed there is no noticeable overhead if > keepalive is only > * sent after every ~100 changes. > */ > #define CHANGES_THRESHOLD 100 > if (++changes_count >= CHANGES_THRESHOLD) > { > rb->update_progress_txn(rb, txn, change); > changes_count = 0; > } Changed as suggested. Attach the new patch. Regards, Wang Wei
Вложения
On Tue, Jan 24, 2023 at 1:45 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Tues, Jan 24, 2023 at 8:28 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Hou-san, Here are my review comments for v5-0001. > > Thanks for your comments. ... > > Changed as suggested. > > Attach the new patch. Thanks! Patch v6 LGTM. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > Attach the new patch. > I think the patch missed to handle the case of non-transactional messages which was previously getting handled. I have tried to address that in the attached. Is there a reason that shouldn't be handled? Apart from that changed a few comments. If my understanding is correct, then we need to change the callback update_progress_txn name as well because now it needs to handle both transactional and non-transactional changes. How about update_progress_write? We accordingly need to change the comments for the callback. Additionally, I think we should have a test case to show we don't time out because of not processing non-transactional messages. See pgoutput_message for cases where it doesn't process the message. -- With Regards, Amit Kapila.
Вложения
On Wednesday, January 25, 2023 7:26 PM Amit Kapila <amit.kapila16@gmail.com> > > On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > Attach the new patch. > > > > I think the patch missed to handle the case of non-transactional messages which > was previously getting handled. I have tried to address that in the attached. Is > there a reason that shouldn't be handled? Thanks for updating the patch! I thought about the non-transactional message. I think it seems fine if we don’t handle it for timeout because such message is decoded via: WalSndLoop -XLogSendLogical --LogicalDecodingProcessRecord ---logicalmsg_decode ----ReorderBufferQueueMessage -----rb->message() -- //maybe send the message or do nothing here. After invoking rb->message(), we will directly return to the main loop(WalSndLoop) where we will get a chance to call WalSndKeepaliveIfNecessary() to avoid the timeout. This is a bit different from transactional changes, because for transactional changes, we will buffer them and then send every buffered change one by one(via ReorderBufferProcessTXN) without going back to the WalSndLoop, so we don't get a chance to send keepalive message if necessary, which is more likely to cause the timeout problem. I will also test the non-transactional message for timeout in case I missed something. Best Regards, Hou zj
On Fri, Jan 27, 2023 at 5:18 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, January 25, 2023 7:26 PM Amit Kapila <amit.kapila16@gmail.com> > > > > On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > > > > > Attach the new patch. > > > > > > > I think the patch missed to handle the case of non-transactional messages which > > was previously getting handled. I have tried to address that in the attached. Is > > there a reason that shouldn't be handled? > > Thanks for updating the patch! > > I thought about the non-transactional message. I think it seems fine if we > don’t handle it for timeout because such message is decoded via: > > WalSndLoop > -XLogSendLogical > --LogicalDecodingProcessRecord > ---logicalmsg_decode > ----ReorderBufferQueueMessage > -----rb->message() -- //maybe send the message or do nothing here. > > After invoking rb->message(), we will directly return to the main > loop(WalSndLoop) where we will get a chance to call > WalSndKeepaliveIfNecessary() to avoid the timeout. > Valid point. But this means the previous handling of non-transactional messages was also redundant. > This is a bit different from transactional changes, because for transactional changes, we > will buffer them and then send every buffered change one by one(via > ReorderBufferProcessTXN) without going back to the WalSndLoop, so we don't get > a chance to send keepalive message if necessary, which is more likely to cause the > timeout problem. > > I will also test the non-transactional message for timeout in case I missed something. > Okay, thanks. Please see if we can test a mix of transactional and non-transactional messages as well. -- With Regards, Amit Kapila.
On Fri, Jan 27, 2023 at 19:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jan 27, 2023 at 5:18 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, January 25, 2023 7:26 PM Amit Kapila > <amit.kapila16@gmail.com> > > > > > > On Tue, Jan 24, 2023 at 8:15 AM wangw.fnst@fujitsu.com > > > <wangw.fnst@fujitsu.com> wrote: > > > > > > > > Attach the new patch. > > > > > > > > > > I think the patch missed to handle the case of non-transactional messages > which > > > was previously getting handled. I have tried to address that in the attached. > Is > > > there a reason that shouldn't be handled? > > > > Thanks for updating the patch! > > > > I thought about the non-transactional message. I think it seems fine if we > > don’t handle it for timeout because such message is decoded via: > > > > WalSndLoop > > -XLogSendLogical > > --LogicalDecodingProcessRecord > > ---logicalmsg_decode > > ----ReorderBufferQueueMessage > > -----rb->message() -- //maybe send the message or do nothing here. > > > > After invoking rb->message(), we will directly return to the main > > loop(WalSndLoop) where we will get a chance to call > > WalSndKeepaliveIfNecessary() to avoid the timeout. > > > > Valid point. But this means the previous handling of non-transactional > messages was also redundant. Thanks for the analysis, I think it makes sense. So I removed the handling of non-transactional messages. > > This is a bit different from transactional changes, because for transactional > changes, we > > will buffer them and then send every buffered change one by one(via > > ReorderBufferProcessTXN) without going back to the WalSndLoop, so we > don't get > > a chance to send keepalive message if necessary, which is more likely to cause > the > > timeout problem. > > > > I will also test the non-transactional message for timeout in case I missed > something. > > > > Okay, thanks. Please see if we can test a mix of transactional and > non-transactional messages as well. I tested a mix transaction of transactional and non-transactional messages on the current HEAD and reproduced the timeout problem. I think this result is OK. Because when decoding a transaction, non-transactional changes are processed directly and the function WalSndKeepaliveIfNecessary is called, while transactional changes are cached and processed after decoding. After decoding, only transactional changes will be processed (in the function ReorderBufferProcessTXN), so the timeout problem will still be reproduced. After applying the v8 patch, the test mentioned above didn't reproduce the timeout problem (Attach this test script 'test_with_nontransactional.sh'). Attach the new patch. Regards, Wang Wei
Вложения
On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > I tested a mix transaction of transactional and non-transactional messages on > the current HEAD and reproduced the timeout problem. I think this result is OK. > Because when decoding a transaction, non-transactional changes are processed > directly and the function WalSndKeepaliveIfNecessary is called, while > transactional changes are cached and processed after decoding. After decoding, > only transactional changes will be processed (in the function > ReorderBufferProcessTXN), so the timeout problem will still be reproduced. > > After applying the v8 patch, the test mentioned above didn't reproduce the > timeout problem (Attach this test script 'test_with_nontransactional.sh'). > > Attach the new patch. > Thanks for updating the patch. Here is a comment. In update_progress_txn_cb_wrapper(), it looks like we need to reset changes_count to 0 after calling OutputPluginUpdateProgress(), otherwise OutputPluginUpdateProgress() will always be called after 100 changes. Regards, Shi yu
On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> wrote: > On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > I tested a mix transaction of transactional and non-transactional messages on > > the current HEAD and reproduced the timeout problem. I think this result is > OK. > > Because when decoding a transaction, non-transactional changes are > processed > > directly and the function WalSndKeepaliveIfNecessary is called, while > > transactional changes are cached and processed after decoding. After > decoding, > > only transactional changes will be processed (in the function > > ReorderBufferProcessTXN), so the timeout problem will still be reproduced. > > > > After applying the v8 patch, the test mentioned above didn't reproduce the > > timeout problem (Attach this test script 'test_with_nontransactional.sh'). > > > > Attach the new patch. > > > > Thanks for updating the patch. Here is a comment. Thanks for your comment. > In update_progress_txn_cb_wrapper(), it looks like we need to reset > changes_count to 0 after calling OutputPluginUpdateProgress(), otherwise > OutputPluginUpdateProgress() will always be called after 100 changes. Yes, I think you are right. Fixed this problem. Attach the new patch. Regards, Wang Wei
Вложения
On Mon, Jan 30, 2023 at 10:36 AM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> wrote: > > On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com > > <wangw.fnst@fujitsu.com> wrote: > > Yes, I think you are right. > Fixed this problem. > + /* + * Trying to send keepalive message after every change has some + * overhead, but testing showed there is no noticeable overhead if + * we do it after every ~100 changes. + */ +#define CHANGES_THRESHOLD 100 + + if (++changes_count < CHANGES_THRESHOLD) + return; ... + changes_count = 0; I think it is better to have this threshold-related code in that caller as we have in the previous version. Also, let's modify the comment as follows:" It is possible that the data is not sent to downstream for a long time either because the output plugin filtered it or there is a DDL that generates a lot of data that is not processed by the plugin. So, in such cases, the downstream can timeout. To avoid that we try to send a keepalive message if required. Trying to send a keepalive message after every change has some overhead, but testing showed there is no noticeable overhead if we do it after every ~100 changes." -- With Regards, Amit Kapila.
On Mon, Jan 30, 2023 at 14:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 30, 2023 at 10:36 AM wangw.fnst@fujitsu.com > <wangw.fnst@fujitsu.com> wrote: > > > > On Mon, Jan 30, 2023 11:37 AM Shi, Yu/侍 雨 <shiy.fnst@cn.fujitsu.com> > wrote: > > > On Sun, Jan 29, 2023 3:41 PM wangw.fnst@fujitsu.com > > > <wangw.fnst@fujitsu.com> wrote: > > > > Yes, I think you are right. > > Fixed this problem. > > > > + /* > + * Trying to send keepalive message after every change has some > + * overhead, but testing showed there is no noticeable overhead if > + * we do it after every ~100 changes. > + */ > +#define CHANGES_THRESHOLD 100 > + > + if (++changes_count < CHANGES_THRESHOLD) > + return; > ... > + changes_count = 0; > > I think it is better to have this threshold-related code in that > caller as we have in the previous version. Also, let's modify the > comment as follows:" > It is possible that the data is not sent to downstream for a long time > either because the output plugin filtered it or there is a DDL that > generates a lot of data that is not processed by the plugin. So, in > such cases, the downstream can timeout. To avoid that we try to send a > keepalive message if required. Trying to send a keepalive message > after every change has some overhead, but testing showed there is no > noticeable overhead if we do it after every ~100 changes." Changed as suggested. I also removed the comment atop the function update_progress_txn_cb_wrapper to be consistent with the nearby *_cb_wrapper functions. Attach the new patch. Regards, Wang Wei
Вложения
On Mon, Jan 30, 2023 at 17:50 PM I wrote: > Attach the new patch. When invoking the function ReorderBufferProcessTXN, the threshold-related counter "changes_count" may have some random value from the previous transaction's processing. To fix this, I moved the definition of the counter "changes_count" outside the while-loop and did not use the keyword "static". Attach the new patch. Regards, Wang Wei
Вложения
On Tue, Jan 31, 2023 at 2:53 PM wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote: > > On Mon, Jan 30, 2023 at 17:50 PM I wrote: > > Attach the new patch. > > When invoking the function ReorderBufferProcessTXN, the threshold-related > counter "changes_count" may have some random value from the previous > transaction's processing. To fix this, I moved the definition of the counter > "changes_count" outside the while-loop and did not use the keyword "static". > > Attach the new patch. > Thanks, the patch looks good to me. I have slightly adjusted one of the comments and ran pgindent. See attached. As mentioned in the commit message, we shouldn't backpatch this as this requires a new callback and moreover, users can increase the wal_sender_timeout and wal_receiver_timeout to avoid this problem. What do you think? -- With Regards, Amit Kapila.
Вложения
On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks, the patch looks good to me. I have slightly adjusted one of > the comments and ran pgindent. See attached. As mentioned in the > commit message, we shouldn't backpatch this as this requires a new > callback and moreover, users can increase the wal_sender_timeout and > wal_receiver_timeout to avoid this problem. What do you think? The callback and the implementation is all in core. What's the risk you see in backpatching it? Customers can adjust the timeouts, but only after the receiver has timed out a few times. Replication remains broekn till they notice it and adjust timeouts. By that time WAL has piled up. It also takes a few attempts to increase timeouts since the time taken by a transaction to decode can not be estimated beforehand. All that makes it worth back-patching if it's possible. We had a customer who piled up GBs of WAL before realising that this is the problem. Their system almost came to a halt due to that. -- Best Wishes, Ashutosh Bapat
On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Thanks, the patch looks good to me. I have slightly adjusted one of > > the comments and ran pgindent. See attached. As mentioned in the > > commit message, we shouldn't backpatch this as this requires a new > > callback and moreover, users can increase the wal_sender_timeout and > > wal_receiver_timeout to avoid this problem. What do you think? > > The callback and the implementation is all in core. What's the risk > you see in backpatching it? > Because we are changing the exposed structure and which can break existing extensions using it. > Customers can adjust the timeouts, but only after the receiver has > timed out a few times. Replication remains broekn till they notice it > and adjust timeouts. By that time WAL has piled up. It also takes a > few attempts to increase timeouts since the time taken by a > transaction to decode can not be estimated beforehand. All that makes > it worth back-patching if it's possible. We had a customer who piled > up GBs of WAL before realising that this is the problem. Their system > almost came to a halt due to that. > Which version are they using? If they are at >=14, using "streaming = on" for a subscription should also avoid this problem. -- With Regards, Amit Kapila.
On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Thanks, the patch looks good to me. I have slightly adjusted one of > > > the comments and ran pgindent. See attached. As mentioned in the > > > commit message, we shouldn't backpatch this as this requires a new > > > callback and moreover, users can increase the wal_sender_timeout and > > > wal_receiver_timeout to avoid this problem. What do you think? > > > > The callback and the implementation is all in core. What's the risk > > you see in backpatching it? > > > > Because we are changing the exposed structure and which can break > existing extensions using it. Is that because we are adding the new member in the middle of the structure? Shouldn't extensions provide new libraries with each maintenance release of PG? > > > Customers can adjust the timeouts, but only after the receiver has > > timed out a few times. Replication remains broekn till they notice it > > and adjust timeouts. By that time WAL has piled up. It also takes a > > few attempts to increase timeouts since the time taken by a > > transaction to decode can not be estimated beforehand. All that makes > > it worth back-patching if it's possible. We had a customer who piled > > up GBs of WAL before realising that this is the problem. Their system > > almost came to a halt due to that. > > > > Which version are they using? If they are at >=14, using "streaming = > on" for a subscription should also avoid this problem. 13. -- Best Wishes, Ashutosh Bapat
Here are my review comments for v13-00001. ====== Commit message 1. The DDLs like Refresh Materialized views that generate lots of temporary data due to rewrite rules may not be processed by output plugins (for example pgoutput). So, we won't send keep-alive messages for a long time while processing such commands and that can lead the subscriber side to timeout. ~ SUGGESTION (minor rearranged way to say the same thing) For DDLs that generate lots of temporary data due to rewrite rules (e.g. REFRESH MATERIALIZED VIEW) the output plugins (e.g. pgoutput) may not be processed for a long time. Since we don't send keep-alive messages while processing such commands that can lead the subscriber side to timeout. ~~~ 2. The commit message says what the problem is, but it doesn’t seem to describe what this patch does to fix the problem. ====== src/backend/replication/logical/reorderbuffer.c 3. + /* + * It is possible that the data is not sent to downstream for a + * long time either because the output plugin filtered it or there + * is a DDL that generates a lot of data that is not processed by + * the plugin. So, in such cases, the downstream can timeout. To + * avoid that we try to send a keepalive message if required. + * Trying to send a keepalive message after every change has some + * overhead, but testing showed there is no noticeable overhead if + * we do it after every ~100 changes. + */ 3a. "data is not sent to downstream" --> "data is not sent downstream" (?) ~ 3b. "So, in such cases," --> "In such cases," ~~~ 4. +#define CHANGES_THRESHOLD 100 + + if (++changes_count >= CHANGES_THRESHOLD) + { + rb->update_progress_txn(rb, txn, change->lsn); + changes_count = 0; + } I was wondering if it would have been simpler to write this code like below. Also, by doing it this way the 'changes_count' variable name makes more sense IMO, otherwise (for current code) maybe it should be called something like 'changes_since_last_keepalive' SUGGESTION if (++changes_count % CHANGES_THRESHOLD == 0) rb->update_progress_txn(rb, txn, change->lsn); ------ Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Feb 1, 2023 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for v13-00001. > > ====== > Commit message > > 1. > The DDLs like Refresh Materialized views that generate lots of temporary > data due to rewrite rules may not be processed by output plugins (for > example pgoutput). So, we won't send keep-alive messages for a long time > while processing such commands and that can lead the subscriber side to > timeout. > > ~ > > SUGGESTION (minor rearranged way to say the same thing) > > For DDLs that generate lots of temporary data due to rewrite rules > (e.g. REFRESH MATERIALIZED VIEW) the output plugins (e.g. pgoutput) > may not be processed for a long time. Since we don't send keep-alive > messages while processing such commands that can lead the subscriber > side to timeout. > Hmm, this makes it less clear and in fact changed the meaning. > ~~~ > > 2. > The commit message says what the problem is, but it doesn’t seem to > describe what this patch does to fix the problem. > I thought it was apparent and the code comments made it clear. > > 4. > +#define CHANGES_THRESHOLD 100 > + > + if (++changes_count >= CHANGES_THRESHOLD) > + { > + rb->update_progress_txn(rb, txn, change->lsn); > + changes_count = 0; > + } > > I was wondering if it would have been simpler to write this code like below. > > Also, by doing it this way the 'changes_count' variable name makes > more sense IMO, otherwise (for current code) maybe it should be called > something like 'changes_since_last_keepalive' > > SUGGESTION > if (++changes_count % CHANGES_THRESHOLD == 0) > rb->update_progress_txn(rb, txn, change->lsn); > I find the current code in the patch clear and easy to understand. -- With Regards, Amit Kapila.
On Tue, Jan 31, 2023 at 8:24 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > Thanks, the patch looks good to me. I have slightly adjusted one of > > > > the comments and ran pgindent. See attached. As mentioned in the > > > > commit message, we shouldn't backpatch this as this requires a new > > > > callback and moreover, users can increase the wal_sender_timeout and > > > > wal_receiver_timeout to avoid this problem. What do you think? > > > > > > The callback and the implementation is all in core. What's the risk > > > you see in backpatching it? > > > > > > > Because we are changing the exposed structure and which can break > > existing extensions using it. > > Is that because we are adding the new member in the middle of the > structure? > Not only that but this changes the size of the structure and we want to avoid that as well in stable branches. See email [1] (you can't change the struct size either ...). As per my understanding, our usual practice is to not change the exposed structure's size/definition in stable branches. [1] - https://www.postgresql.org/message-id/2358496.1649168259%40sss.pgh.pa.us -- With Regards, Amit Kapila.
On Wed, Feb 1, 2023 at 10:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 31, 2023 at 8:24 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Tue, Jan 31, 2023 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jan 31, 2023 at 5:03 PM Ashutosh Bapat > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > On Tue, Jan 31, 2023 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Thanks, the patch looks good to me. I have slightly adjusted one of > > > > > the comments and ran pgindent. See attached. As mentioned in the > > > > > commit message, we shouldn't backpatch this as this requires a new > > > > > callback and moreover, users can increase the wal_sender_timeout and > > > > > wal_receiver_timeout to avoid this problem. What do you think? > > > > > > > > The callback and the implementation is all in core. What's the risk > > > > you see in backpatching it? > > > > > > > > > > Because we are changing the exposed structure and which can break > > > existing extensions using it. > > > > Is that because we are adding the new member in the middle of the > > structure? > > > > Not only that but this changes the size of the structure and we want > to avoid that as well in stable branches. See email [1] (you can't > change the struct size either ...). As per my understanding, our usual > practice is to not change the exposed structure's size/definition in > stable branches. > > I am planning to push this to HEAD sometime next week (by Wednesday). To backpatch this, we need to fix it in some non-standard way, like without introducing a callback which I am not sure is a good idea. If some other committers vote to get this in back branches with that or some different idea that can be backpatched then we can do that separately as well. I don't see this as a must-fix in back branches because we have a workaround (increase timeout) or users can use the streaming option (for >=14). -- With Regards, Amit Kapila.
Hi, On 2023-02-03 10:13:54 +0530, Amit Kapila wrote: > I am planning to push this to HEAD sometime next week (by Wednesday). > To backpatch this, we need to fix it in some non-standard way, like > without introducing a callback which I am not sure is a good idea. If > some other committers vote to get this in back branches with that or > some different idea that can be backpatched then we can do that > separately as well. I don't see this as a must-fix in back branches > because we have a workaround (increase timeout) or users can use the > streaming option (for >=14). I just saw the commit go in, and a quick scan over it makes me think neither this commit, nor f95d53eded, which unfortunately was already backpatched, is the right direction. The wrong direction likely started quite a bit earlier, with 024711bb544. It feels quite fundamentally wrong that bascially every output plugin needs to call a special function in nearly every callback. In 024711bb544 there was just one call to OutputPluginUpdateProgress() in pgoutput.c. Quite tellingly, it just updated pgoutput, without touching test_decoding. Then a8fd13cab0b added to more calls. 63cf61cdeb7 yet another. This makes no sense. There's lots of output plugins out there. There's an increasing number of callbacks. This isn't a maintainable path forward. If we want to call something to maintain state, it has to be happening from central infrastructure. It feels quite odd architecturally that WalSndUpdateProgress() ends up flushing out writes - that's far far from obvious. I don't think: /* * Wait until there is no pending write. Also process replies from the other * side and check timeouts during that. */ static void ProcessPendingWrites(void) Is really a good name. What are we processing? What are we actually waiting for - because we don't actually wait for the data to sent out or anything, just that they're in a network buffer. Greetings, Andres Freund
On Wed, Feb 8, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-02-03 10:13:54 +0530, Amit Kapila wrote: > > I am planning to push this to HEAD sometime next week (by Wednesday). > > To backpatch this, we need to fix it in some non-standard way, like > > without introducing a callback which I am not sure is a good idea. If > > some other committers vote to get this in back branches with that or > > some different idea that can be backpatched then we can do that > > separately as well. I don't see this as a must-fix in back branches > > because we have a workaround (increase timeout) or users can use the > > streaming option (for >=14). > > I just saw the commit go in, and a quick scan over it makes me think neither > this commit, nor f95d53eded, which unfortunately was already backpatched, is > the right direction. The wrong direction likely started quite a bit earlier, > with 024711bb544. > > It feels quite fundamentally wrong that bascially every output plugin needs to > call a special function in nearly every callback. > > In 024711bb544 there was just one call to OutputPluginUpdateProgress() in > pgoutput.c. Quite tellingly, it just updated pgoutput, without touching > test_decoding. > > Then a8fd13cab0b added to more calls. 63cf61cdeb7 yet another. > I think the original commit 024711bb544 forgets to call it in test_decoding and the other commits followed the same and missed to update test_decoding. > > This makes no sense. There's lots of output plugins out there. There's an > increasing number of callbacks. This isn't a maintainable path forward. > > > If we want to call something to maintain state, it has to be happening from > central infrastructure. > > > It feels quite odd architecturally that WalSndUpdateProgress() ends up > flushing out writes - that's far far from obvious. > > I don't think: > /* > * Wait until there is no pending write. Also process replies from the other > * side and check timeouts during that. > */ > static void > ProcessPendingWrites(void) > > Is really a good name. What are we processing? > It is for sending the keep_alive message (if required). That is normally done when we skipped processing a transaction to ensure sync replication is not delayed. It has been discussed previously [1][2] to extend the WalSndUpdateProgress() interface. Basically, as explained by Craig [2], this has to be done from plugin as it can do filtering or there could be other reasons why the output plugin skips all changes. We used the same interface for sending keep-alive message when we processed a lot of (DDL) changes without sending anything to plugin. [1] - https://www.postgresql.org/message-id/20200309183018.tzkzwu635sd366ej%40alap3.anarazel.de [2] - https://www.postgresql.org/message-id/CAMsr%2BYE3o8Dt890Q8wTooY2MpN0JvdHqUAHYL-LNhBryXOPaKg%40mail.gmail.com -- With Regards, Amit Kapila.
Hi, On 2023-02-08 13:36:02 +0530, Amit Kapila wrote: > On Wed, Feb 8, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2023-02-03 10:13:54 +0530, Amit Kapila wrote: > > > I am planning to push this to HEAD sometime next week (by Wednesday). > > > To backpatch this, we need to fix it in some non-standard way, like > > > without introducing a callback which I am not sure is a good idea. If > > > some other committers vote to get this in back branches with that or > > > some different idea that can be backpatched then we can do that > > > separately as well. I don't see this as a must-fix in back branches > > > because we have a workaround (increase timeout) or users can use the > > > streaming option (for >=14). > > > > I just saw the commit go in, and a quick scan over it makes me think neither > > this commit, nor f95d53eded, which unfortunately was already backpatched, is > > the right direction. The wrong direction likely started quite a bit earlier, > > with 024711bb544. > > > > It feels quite fundamentally wrong that bascially every output plugin needs to > > call a special function in nearly every callback. > > > > In 024711bb544 there was just one call to OutputPluginUpdateProgress() in > > pgoutput.c. Quite tellingly, it just updated pgoutput, without touching > > test_decoding. > > > > Then a8fd13cab0b added to more calls. 63cf61cdeb7 yet another. > > > > I think the original commit 024711bb544 forgets to call it in > test_decoding and the other commits followed the same and missed to > update test_decoding. I think that's a symptom of the wrong architecture having been chosen. This should *never* have been the task of output plugins. > > I don't think: > > /* > > * Wait until there is no pending write. Also process replies from the other > > * side and check timeouts during that. > > */ > > static void > > ProcessPendingWrites(void) > > > > Is really a good name. What are we processing? > > > > It is for sending the keep_alive message (if required). That is > normally done when we skipped processing a transaction to ensure sync > replication is not delayed. But how is that "processing pending writes"? For me "processing" implies we're doing some analysis on them or such. If we want to write data in WalSndUpdateProgress(), shouldn't we move the common code of WalSndWriteData() and WalSndUpdateProgress() into ProcessPendingWrites()? > It has been discussed previously [1][2] to > extend the WalSndUpdateProgress() interface. Basically, as explained > by Craig [2], this has to be done from plugin as it can do filtering > or there could be other reasons why the output plugin skips all > changes. We used the same interface for sending keep-alive message > when we processed a lot of (DDL) changes without sending anything to > plugin. > > [1] - https://www.postgresql.org/message-id/20200309183018.tzkzwu635sd366ej%40alap3.anarazel.de > [2] - https://www.postgresql.org/message-id/CAMsr%2BYE3o8Dt890Q8wTooY2MpN0JvdHqUAHYL-LNhBryXOPaKg%40mail.gmail.com I don't buy that this has to be done by the output plugin. The actual sending out of data happens via the LogicalDecodingContext callbacks, so we very well can know whether we recently did send out data or not. This really is a concern of the LogicalDecodingContext, it has pretty much nothing to do with output plugins. We should remove all calls of OutputPluginUpdateProgress() from pgoutput, and add the necessary calls to LogicalDecodingContext->update_progress() to generic code. And Additionally we should either rename WalSndUpdateProgress(), because it's now doing *far* more than "updating progress", or alternatively, split it into two functions. I don't think the syncrep logic in WalSndUpdateProgress really works as-is - consider what happens if e.g. the origin filter filters out entire transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases we'll be lucky because we'll return quickly to XLogSendLogical(), but not reliably. Greetings, Andres Freund
Hi, On 2023-02-08 10:18:41 -0800, Andres Freund wrote: > I don't think the syncrep logic in WalSndUpdateProgress really works as-is - > consider what happens if e.g. the origin filter filters out entire > transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases > we'll be lucky because we'll return quickly to XLogSendLogical(), but not > reliably. Is it actually the right thing to check SyncRepRequested() in that logic? It's quite common to set up syncrep so that individual users or transactions opt into syncrep, but to leave the default disabled. I don't really see an alternative to making this depend solely on sync_standbys_defined. Greetings, Andres Freund
Hi, On 2023-02-08 10:30:37 -0800, Andres Freund wrote: > On 2023-02-08 10:18:41 -0800, Andres Freund wrote: > > I don't think the syncrep logic in WalSndUpdateProgress really works as-is - > > consider what happens if e.g. the origin filter filters out entire > > transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases > > we'll be lucky because we'll return quickly to XLogSendLogical(), but not > > reliably. > > Is it actually the right thing to check SyncRepRequested() in that logic? It's > quite common to set up syncrep so that individual users or transactions opt > into syncrep, but to leave the default disabled. > > I don't really see an alternative to making this depend solely on > sync_standbys_defined. Hacking on a rough prototype how I think this should rather look, I had a few questions / remarks: - We probably need to call UpdateProgress from a bunch of places in decode.c as well? Indicating that we're lagging by a lot, just because all transactions were in another database seems decidedly suboptimal. - Why should lag tracking only be updated at commit like points? That seems like it adds odd discontinuinities? - The mix of skipped_xact and ctx->end_xact in WalSndUpdateProgress() seems somewhat odd. They have very overlapping meanings IMO. - there's no UpdateProgress calls in pgoutput_stream_abort(), but ISTM there should be? It's legit progress. - That's from 6912acc04f0: I find LagTrackerRead(), LagTrackerWrite() quite confusing, naming-wise. IIUC "reading" is about receiving confirmation messages, "writing" about the time the record was generated. ISTM that the current time is a quite poor approximation in XLogSendPhysical(), but pretty much meaningless in WalSndUpdateProgress()? Am I missing something? - Aren't the wal_sender_timeout / 2 checks in WalSndUpdateProgress(), WalSndWriteData() missing wal_sender_timeout <= 0 checks? - I don't really understand why f95d53edged55 added !end_xact to the if condition for ProcessPendingWrites(). Is the theory that we'll end up in an outer loop soon? Attached is a current, quite rough, prototype. It addresses some of the points raised, but far from all. There's also several XXXs/FIXMEs in it. I changed the file-ending to .txt to avoid hijacking the CF entry. Greetings, Andres Freund
Вложения
On Thu, Feb 9, 2023 at 1:33 AM Andres Freund <andres@anarazel.de> wrote: > > Hacking on a rough prototype how I think this should rather look, I had a few > questions / remarks: > > - We probably need to call UpdateProgress from a bunch of places in decode.c > as well? Indicating that we're lagging by a lot, just because all > transactions were in another database seems decidedly suboptimal. > We can do that but I think in all those cases we will reach quickly enough back to walsender logic (WalSndLoop - that will send keepalive if required) that we don't need to worry. After processing each record, the logic will return back to the main loop that will send keepalive if required. Also, while reading WAL if we need to block, it will call WalSndWaitForWal() which will send keepalive if required. The real problem we have seen in the field reports or tests is that when we process a large transaction where changes are queued in the reorderbuffer and while processing those we discard all or most of the changes. The patch calls update_progress in change_cb_wrapper and other wrappers which will miss the case of DDLs that generates a lot of data that is not processed by the plugin. I think for that we either need to call update_progress from reorderbuffer.c similar to what the patch has removed or we need some other way to address it. Do you have any better idea? > - Why should lag tracking only be updated at commit like points? That seems > like it adds odd discontinuinities? > We have previously experimented to call it from non-commit locations but that turned out to give inaccurate information about Lag. See email [1]. > - The mix of skipped_xact and ctx->end_xact in WalSndUpdateProgress() seems > somewhat odd. They have very overlapping meanings IMO. > > - there's no UpdateProgress calls in pgoutput_stream_abort(), but ISTM there > should be? It's legit progress. > Agreed with both of the above points. > - That's from 6912acc04f0: I find LagTrackerRead(), LagTrackerWrite() quite > confusing, naming-wise. IIUC "reading" is about receiving confirmation > messages, "writing" about the time the record was generated. ISTM that the > current time is a quite poor approximation in XLogSendPhysical(), but pretty > much meaningless in WalSndUpdateProgress()? Am I missing something? > Leaving it for Thomas to answer. > - Aren't the wal_sender_timeout / 2 checks in WalSndUpdateProgress(), > WalSndWriteData() missing wal_sender_timeout <= 0 checks? > It seems we are checking that via ProcessPendingWrites()->WalSndKeepaliveIfNecessary(). Do you think we need to check it before as well? > - I don't really understand why f95d53edged55 added !end_xact to the if > condition for ProcessPendingWrites(). Is the theory that we'll end up in an > outer loop soon? > Yes. For non-empty xacts, we will anyway send a commit message. For empty (skipped) xacts, we will send for synchronous replication case to avoid any delay. > > Attached is a current, quite rough, prototype. It addresses some of the points > raised, but far from all. There's also several XXXs/FIXMEs in it. I changed > the file-ending to .txt to avoid hijacking the CF entry. > I have started a separate thread to avoid such confusion. I hope that is fine with you. > > > I don't think the syncrep logic in WalSndUpdateProgress really works as-is - > > > consider what happens if e.g. the origin filter filters out entire > > > transactions. We'll afaics never get to WalSndUpdateProgress(). In some cases > > > we'll be lucky because we'll return quickly to XLogSendLogical(), but not > > > reliably. > > Which case are you worried about? As mentioned in one of the previous points I thought the timeout/keepalive handling in the callers should be enough. > > Is it actually the right thing to check SyncRepRequested() in that logic? It's > > quite common to set up syncrep so that individual users or transactions opt > > into syncrep, but to leave the default disabled. > > > > I don't really see an alternative to making this depend solely on > > sync_standbys_defined. Fair point. How about renaming ProcessPendingWrites to WaitToSendPendingWrites or WalSndWaitToSendPendingWrites? [1] - https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Thu, Feb 9, 2023 at 11:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > How about renaming ProcessPendingWrites to WaitToSendPendingWrites or > WalSndWaitToSendPendingWrites? > How about renaming WalSndUpdateProgress() to WalSndUpdateProgressAndSendKeepAlive() or WalSndUpdateProgressAndKeepAlive()? One thing to note about the changes we are discussing here is that some of the plugins like wal2json already call OutputPluginUpdateProgress in their commit callback. They may need to update it accordingly. One difference I see with the patch is that I think we will end up sending keepalive for empty prepared transactions even though we don't skip sending begin/prepare messages for those. The reason why we don't skip sending prepare for empty 2PC xacts is that if the WALSender restarts after the PREPARE of a transaction and before the COMMIT PREPARED of the same transaction then we won't be able to figure out if we have skipped sending BEGIN/PREPARE of a transaction. To skip sending prepare for empty xacts, we previously thought of some ideas like (a) At commit-prepare time have a check on the subscriber-side to know whether there is a corresponding prepare for it before actually doing commit-prepare but that sounded costly. (b) somehow persist the information whether the PREPARE for a xact is already sent and then use that information for commit prepared but again that also didn't sound like a good idea. -- With Regards, Amit Kapila.
On Wed, 8 Feb 2023 at 15:04, Andres Freund <andres@anarazel.de> wrote: > > Attached is a current, quite rough, prototype. It addresses some of the points > raised, but far from all. There's also several XXXs/FIXMEs in it. I changed > the file-ending to .txt to avoid hijacking the CF entry. It looks like this patch has received quite a generous helping of feedback from Andres. I'm setting it to Waiting on Author. On the one hand it looks like there's a lot of work to do on this but on the other hand it sounds like this is a live problem in the field so if it can get done in time for release that would be great but if not then feel free to move it to the next commitfest (which means next release). -- Gregory Stark As Commitfest Manager
On Thu, Mar 2, 2023 at 4:19 AM Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote: > On Wed, 8 Feb 2023 at 15:04, Andres Freund <andres@anarazel.de> wrote: > > > > Attached is a current, quite rough, prototype. It addresses some of the points > > raised, but far from all. There's also several XXXs/FIXMEs in it. I changed > > the file-ending to .txt to avoid hijacking the CF entry. > > It looks like this patch has received quite a generous helping of > feedback from Andres. I'm setting it to Waiting on Author. > > On the one hand it looks like there's a lot of work to do on this but > on the other hand it sounds like this is a live problem in the field > so if it can get done in time for release that would be great but if > not then feel free to move it to the next commitfest (which means next > release). Hi, Since this patch is an improvement to the architecture in HEAD, we started another new thread [1] on this topic to develop related patch. It seems that we could modify the details of this CF entry to point to the new thread and change the status to 'Needs Review'. [1] - https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de Regards, Wang Wei