Re: Single transaction in the tablesync worker?
От | Peter Smith |
---|---|
Тема | Re: Single transaction in the tablesync worker? |
Дата | |
Msg-id | CAHut+Pv2dWz+SkOUm-ERzOoY1PMRv0mejsT98O6oNCraKVH6xg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Single transaction in the tablesync worker? (Amit Kapila <amit.kapila16@gmail.com>) |
Список | pgsql-hackers |
On Sat, Dec 19, 2020 at 5:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > TODO / Known Issues: > > > > * the current implementation of tablesync drop slot (e.g. from > > DropSubscription or finish_sync_worker) regenerates the tablesync slot > > name so it knows what slot to drop. > > > > If you always drop the slot at finish_sync_worker, then in which case > do you need to drop it during DropSubscription? Is it when the table > sync workers are crashed? Yes. It is not the normal case. But if the tablesync never yet got to SYNCDONE state (maybe crashed) then finish_sync_worker may not be called. So I think a rogue tablesync slot might still exist during DropSubscription. > > > The current code might be ok for > > normal use cases, but if there is an ALTER SUBSCRIPTION ... SET > > (slot_name = newname) it would fail to be able to find the tablesync > > slot. > > > > Sure, but the same will be true for the apply worker slot as well. I > agree the problem would be more for table sync workers but I think we > can solve it, see below. > > > * I think if there are crashed tablesync workers then they are not > > known to DropSubscription. So this might be a problem to cleanup slots > > and/or origin tracking belonging to those unknown workers. > > > > Yeah, I think we can do two things to avoid this and the previous > problem. (a) We can generate the slot_name for the table sync worker > based on only subscription_id and rel_id. (b) Immediately after > creating the slot, advance the replication origin with the position > (origin_startpos) we get from walrcv_create_slot, this will help us to > start from the right location. > > Do you see anything which will still not be addressed after doing the above? (a) V5 Patch is updated as suggested. (b) V5 Patch is updated as suggested. Now calling replorigin_advance. No problems seen so far. All TAP tests pass, but more testing needed for the origin stuff > > I understand why you are trying to create this patch atop logical > decoding of 2PC patch but I think it is better to create this as an > independent patch and then use it to test 2PC problem. OK. The latest patch still applies to v30 just for my convenience today, but I will head towards converting this to an independent patch ASAP. > Also, please > explain what kind of testing you did to ensure that it works properly > after the table sync worker restarts after the crash. So far tested like this - I caused the tablesync to crash after COPYDONE (but before SYNCDONE) by sending a row to cause a PK violation while holding the tablesync at the CATCHUP state in the debugger. The tablesync then handles the insert, encounters the PK violation error, and re-launches. Then I can remove the extra row so the PK violation does not happen, so the (re-launched) tablesync can complete and finish normally. The Apply worker then takes over. I have attached some captured/annotated logging of my test scenario which I ran using the V4 patch (the log has a lot of extra temporary output to help see what is going on) --- Kind Regards, Peter Smith. Fujitsu Australia.
Вложения
В списке pgsql-hackers по дате отправления: