Re: Slow catchup of 2PC (twophase) transactions on replica in LR

Поиск
Список
Период
Сортировка
От Ajin Cherian
Тема Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Дата
Msg-id CAFPTHDa=pJSZ_4dV5DPAOapRSgPcyyUTP0WzGY2Rz_D3-gwraw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Slow catchup of 2PC (twophase) transactions on replica in LR  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Slow catchup of 2PC (twophase) transactions on replica in LR  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers


On Thu, Apr 4, 2024 at 4:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think this would probably be better than the current situation but
can we think of a solution to allow toggling the value of two_phase
even when prepared transactions are present? Can you please summarize
the reason for the problems in doing that and the solutions, if any?

--
With Regards,
Amit Kapila.

Updated the patch, as it wasn't addressing updating of two-phase in the remote slot.

 Currently the main issue that needs to be handled is the handling of pending prepared transactions while the two_phase is altered. I see 3 issues with the current approach.

1. Uncommitted prepared transactions when toggling two_phase from true to false
  When two_phase was true, prepared transactions were decoded at PREPARE time and send to the subscriber, which is then prepared on the subscriber with a new gid. Once the two_phase is toggled to false, then the COMMIT PREPARED on the publisher is converted to commit and the entire transaction is decoded and sent to the subscriber. This will   leave the previously prepared transaction pending.

2. Uncommitted prepared transactions when toggling two_phase form false to true
  When two_phase was false, prepared transactions were ignored and not decoded at PREPARE time on the publisher. Once the two_phase is toggled to true, the apply worker and the walsender are restarted and a replication is restarted from a new "start_decoding_at" LSN. Now, this new "start_decoding_at" could be past the LSN of the PREPARE record and if so, the PREPARE record is skipped and not send to the subscriber. Look at comments in DecodeTXNNeedSkip() for detail.  Later when the user issues COMMIT PREPARED, this is decoded and sent to the subscriber. but there is no prepared transaction on the subscriber, and this fails because the  corresponding gid of the transaction couldn't be found.

3. While altering the two_phase of the subscription, it is required to also alter the two_phase field of the slot on the primary. The subscription cannot remotely alter the two_phase option of the slot when the subscription is  enabled, as the slot is owned by the walsender on the publisher side. 

Possible solutions for the 3 problems:

1. While toggling two_phase from true to false, we could probably get list of prepared transactions for this subscriber id and rollback/abort the prepared transactions. This will allow the transactions to be re-applied like a normal transaction when the commit comes. Alternatively, if this isn't appropriate doing it in the ALTER SUBSCRIPTION context, we could store the xids of all prepared transactions of this subscription in a list and when the corresponding xid is being committed by the apply worker, prior to commit, we make sure the previously prepared transaction is rolled back. But this would add the overhead of checking this list every time a transaction is committed by the apply worker.

2. No solution yet.

3. We could mandate that the altering of two_phase state only be done after disabling the subscription, just like how it is handled for failover option. Let me know your thoughts.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bertrand Drouvot
Дата:
Сообщение: Re: Synchronizing slots from primary to standby
Следующее
От: shveta malik
Дата:
Сообщение: Re: Synchronizing slots from primary to standby