Re: Single transaction in the tablesync worker?

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: Single transaction in the tablesync worker?
Дата
Msg-id CAHut+Pv2dWz+SkOUm-ERzOoY1PMRv0mejsT98O6oNCraKVH6xg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Sat, Dec 19, 2020 at 5:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > TODO / Known Issues:
> >
> > * the current implementation of tablesync drop slot (e.g. from
> > DropSubscription or finish_sync_worker) regenerates the tablesync slot
> > name so it knows what slot to drop.
> >
>
> If you always drop the slot at finish_sync_worker, then in which case
> do you need to drop it during DropSubscription? Is it when the table
> sync workers are crashed?

Yes. It is not the normal case. But if the tablesync never yet got to
SYNCDONE state (maybe crashed) then finish_sync_worker may not be
called.
So I think a rogue tablesync slot might still exist during DropSubscription.

>
> > The current code might be ok for
> > normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
> > (slot_name = newname) it would fail to be able to find the tablesync
> > slot.
> >
>
> Sure, but the same will be true for the apply worker slot as well. I
> agree the problem would be more for table sync workers but I think we
> can solve it, see below.
>
> > * I think if there are crashed tablesync workers then they are not
> > known to DropSubscription. So this might be a problem to cleanup slots
> > and/or origin tracking belonging to those unknown workers.
> >
>
> Yeah, I think we can do two things to avoid this and the previous
> problem. (a) We can generate the slot_name for the table sync worker
> based on only subscription_id and rel_id. (b) Immediately after
> creating the slot, advance the replication origin with the position
> (origin_startpos) we get from walrcv_create_slot, this will help us to
> start from the right location.
>
> Do you see anything which will still not be addressed after doing the above?

(a) V5 Patch is updated as suggested.
(b) V5 Patch is updated as suggested. Now calling replorigin_advance.
No problems seen so far. All TAP tests pass, but more testing needed
for the origin stuff

>
> I understand why you are trying to create this patch atop logical
> decoding of 2PC patch but I think it is better to create this as an
> independent patch and then use it to test 2PC problem.

OK. The latest patch still applies to v30 just for my convenience
today, but I will head towards converting this to an independent patch
ASAP.

> Also, please
> explain what kind of testing you did to ensure that it works properly
> after the table sync worker restarts after the crash.

So far tested like this - I caused the tablesync to crash after
COPYDONE (but before SYNCDONE) by sending a row to cause a PK
violation while holding the tablesync at the CATCHUP state in the
debugger. The tablesync then handles the insert, encounters the PK
violation error, and re-launches. Then I can remove the extra row so
the PK violation does not happen, so the (re-launched) tablesync can
complete and finish normally. The Apply worker then takes over.

I have attached some captured/annotated logging of my test scenario
which I ran using the V4 patch (the log has a lot of extra temporary
output to help see what is going on)

---
Kind Regards,
Peter Smith.
Fujitsu Australia.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Smith
Дата:
Сообщение: Re: Single transaction in the tablesync worker?
Следующее
От: Noah Misch
Дата:
Сообщение: Re: [PATCH] Logical decoding of TRUNCATE