RE: Logical replication timeout problem

Поиск
Список
Период
Сортировка
От wangw.fnst@fujitsu.com
Тема RE: Logical replication timeout problem
Дата
Msg-id OS3PR01MB62751F9FC898C73E638E36489E5B9@OS3PR01MB6275.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Logical replication timeout problem  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Logical replication timeout problem  (Fabrice Chapuis <fabrice636861@gmail.com>)
Список pgsql-hackers
On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It might be not reaching the actual send_keep_alive logic in
> WalSndKeepaliveIfNecessary because of below code:
> {
> ...
> /*
> * Don't send keepalive messages if timeouts are globally disabled or
> * we're doing something not partaking in timeouts.
> */
> if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> }
> 
> I think you can add elog before the above return and before updating progress
> in the below code:
> case REORDER_BUFFER_CHANGE_INSERT:
>   if (!relentry->pubactions.pubinsert)
> + {
> + OutputPluginUpdateProgress(ctx);
>   return;
> 
> This will help us to rule out one possibility.

Thanks for your advices!

According to your advices, I applied 0001,0002 and 0003 to run the test script.
When subscriber timeout, I filter publisher-side log:
$ grep "before invoking update progress" pub.log | wc -l
60373557
$ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log | wc -l
0
$ grep "return because waiting_for_ping_response" pub.log | wc -l
0

Based on this result, I think function WalSndKeepaliveIfNecessary was invoked,
but function WalSndKeepalive was not invoked because (last_processing >=
ping_time) is false.
So I tried to see changes about last_processing and last_reply_timestamp
(because ping_time is based on last_reply_timestamp).

I found last_processing and last_reply_timestamp is set in function
ProcessRepliesIfAny.
last_processing is set to the time when function ProcessRepliesIfAny is
invoked.
Only when publisher receive a response from subscriber, last_reply_timestamp is
set to last_processing and the flag waiting_for_ping_response is reset to
false.

When we are during the loop to skip all the changes of transaction, IIUC, we do
not invoke function ProcessRepliesIfAny. So I think last_processing and
last_reply_timestamp will not be changed in this loop.
Therefore I think about our use case, we should modify the condition of
invoking WalSndKeepalive.(please refer to
0004-Simple-modification-of-timing.patch, and note that this is only a patch
for testing).
At the same time I modify the input of WalSndKeepalive from true to false. This
is because when input is true, waiting_for_ping_response is set to true in
WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in the
loop, so I think waiting_for_ping_response will not be reset to false and
keepalive messages will not be sent.

I tested after applying patches(0001 and 0004), I found the timeout was not
printed in subscriber-side log. And the added messages "begin load changes" and
"commit the log" were printed in publisher-side log:
$ grep -ir "begin load changes" pub.log
2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
$ grep -ir "commit the log" pub.log
2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log

Attach the patches and test script mentioned above, in case someone wants to
try.

Regards,
Wang wei

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: ICU for global collation
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: row filtering for logical replication