Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Поиск
Список
Период
Сортировка
От Melih Mutlu
Тема Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Дата
Msg-id CAGPVpCTKqBZxkAj4rTqwzdH034fUfwfUF1fHTTsNrt4+gkJJZQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi Amit,

I updated the patch in order to prevent the problems that might be caused by using different replication slots for syncing a table.
As suggested in previous emails, replication slot names are stored in the catalog. So slot names can be reached later and it is ensured 
that same replication slot is used during tablesync step of a table.

With the current version of the patch:
-. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It stores the slot name for tablesync

-. Tablesync worker logic is now as follows:
1. Tablesync worker is launched by apply worker for a table.
2. Worker generates a default replication slot name for itself. Slot name includes subid and worker pid for tracking purposes.
3. If table has a slot name value in the catalog: 
i. If the table state is DATASYNC, drop the replication slot from the catalog and proceed tablesync with a new slot.
ii. If the table state is FINISHEDCOPY, use the replicaton slot from the catalog, do not create a new slot.
4. Before worker moves to new table, drop any replication slot that are retrieved from the catalog and used.
5. In case of no table left to sync, drop the replication slot of that sync worker with worker pid if it exists. (It's possible that a sync worker do not create a replication slot for itself but uses slots read from the catalog in each iteration) 
 

I think using worker_pid also has similar risks of mixing slots from
different workers because after restart same worker_pid could be
assigned to a totally different worker. Can we think of using a unique
64-bit number instead? This will be allocated when each workers
started for the very first time and after that we can refer catalog to
find it as suggested in the idea below.

I'm not sure how likely to have colliding pid's for different tablesync workers in the same subscription.
Though ,having pid in slot name makes it easier to track which slot belongs to which worker. That's why I kept using pid in slot names.
But I think it should be simple to switch to using a unique 64-bit number. So I can remove pid's from slot names, if you think that it would be better. 
 
We should use the same for the origin name as well.

I did not really change anything related to origin names. Origin names are still the same and include relation id. What do you think would be an issue with origin names in this patch? 
 
This sounds tricky. Why not first drop slot/origin and then detach it
from pg_subscription_rel? On restarts, it is possible that we may
error out after dropping the slot or origin but before updating the
catalog entry but in such case we can ignore missing slot/origin and
detach them from pg_subscription_rel. Also, if we use the unique
number as suggested above, I think even if we don't remove it after
the relation state is ready, it should be okay.

Right, I did not add an additional slot cleanup step. The patch now drops the slot when we're done with it and then removes it from the catalog. 


Thanks,
Melih 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "shiy.fnst@fujitsu.com"
Дата:
Сообщение: RE: Introduce wait_for_subscription_sync for TAP tests
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.