Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Поиск
Список
Период
Сортировка
От Melih Mutlu
Тема Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Дата
Msg-id CAGPVpCQ8m0Jcy2gip=qLis7XPyO2bkb1ZMUBWSXQcyv3tEtz2Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
Список pgsql-hackers
Hi,

Melih Mutlu <m.melihmutlu@gmail.com>, 16 Şub 2023 Per, 14:37 tarihinde şunu yazdı:
I see that setting originname in the catalog before actually creating it causes issues. My concern with setting originname when setting the state to FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update slot/origin name but it fails somewhere before reaching FINISHEDCOPY and after creating slot/origin, then this new created slot/origin will be left behind. It wouldn't be possible to find and drop them since their names are not stored in the catalog. Eventually, this might also cause hitting the max_replication_slots limit in case of such failures between origin creation and FINISHEDCOPY.

One fix I can think is to update the catalog right after creating a new origin. But this would also require commiting the current transaction to actually persist the originname. I guess this action of commiting the transaction in the middle of initial sync could hurt the copy process.

Here are more thoughts on this:
I still believe that updating originname when setting state to FINISHEDCOPY is not a good idea since any failure before FINISHEDCOPY prevent us to store originname in the catalog. If an origin or slot is not in the catalog, it's not easily possible to find and drop origins/slot that are left behind. And we definitely do not want to keep unnecessary origins/slots since we would hit max_replication_slots limit. 
It's better to be safe and update origin/slot names when setting state to DATASYNC. At this point, the worker must be sure that it writes correct origin/slot names into the catalog. 
Following part actually cleans up the catalog if a table is left behind in DATASYNC state and its slot and origin cannot be used for sync.

ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, prev_slotname, true);

StartTransactionCommand();
/* Replication origin might still exist. Try to drop */
replorigin_drop_by_name(originname, true, false);

/*
* Remove replication slot and origin name from the relation's
* catalog record
*/
UpdateSubscriptionRel(MyLogicalRepWorker->subid,
 MyLogicalRepWorker->relid,
 MyLogicalRepWorker->relstate,
 MyLogicalRepWorker->relstate_lsn,
 NULL,
 NULL);

The patch needs to refresh the origin name before it begins copying the table. It will try to read from the catalog but won't find any slot/origin since they are cleaned. Then, it will move on with the correct origin name which is the one created/will be created for the current sync worker.

I tested refetching originname and seems like it fixes the errors you reported.

Thanks,
--
Melih Mutlu
Microsoft

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "wangw.fnst@fujitsu.com"
Дата:
Сообщение: RE: Fix the description of GUC "max_locks_per_transaction" and "max_pred_locks_per_transaction" in guc_table.c
Следующее
От: Melih Mutlu
Дата:
Сообщение: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication