Re: Synchronizing slots from primary to standby
От | shveta malik |
---|---|
Тема | Re: Synchronizing slots from primary to standby |
Дата | |
Msg-id | CAJpy0uDJ8rV1LWgPOXwcV9Xb1ESyB_0RJ6Ksbkc4FW4JgXXUcg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Ответы |
Re: Synchronizing slots from primary to standby
(shveta malik <shveta.malik@gmail.com>)
|
Список | pgsql-hackers |
On Thu, Aug 17, 2023 at 11:55 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > slots as compared to primary slots with different number of slot-sync > > > workers configured. > > > > > > > Thanks! > > > > > 3 DBs were created, each with 30 tables and each table having one > > > logical-pub/sub configured. So this made a total of 90 logical > > > replication slots to be synced. Then the workload was run for aprox 10 > > > mins. During this workload, at regular intervals, primary and standby > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > each capture, the intent was to know how much is each standby's slot > > > lagging behind corresponding primary's slot by taking the distance > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > the average (integer value) of this distance over the span of 10 min > > > workload > > > > Thanks for the explanations, make sense to me. > > > > > and this is what we got: > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > This shows that more workers have better chances to keep logical > > > replication slots in sync for this case. > > > > > > > Agree. > > > > > Another statistics if it interests you is, we ran a frequency test as > > > well (this by changing code, unit test sort of) to figure out the > > > 'total number of times synchronization done' with different number of > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > and 3; total number of times synchronization done was 15874, 20205 and > > > 23414 respectively. Note: this is not on the same machine where we > > > captured lsn-gap data, it is on a little less efficient machine but > > > gives almost the same picture > > > > > > Next we are planning to capture this data for a lesser number of slots > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > over single workers in such cases could be less, but let's have the > > > data to verify that. > > > > > > > Thanks a lot for those numbers and for the testing! > > > > Do you think it would make sense to also get the number of using > > the pg_failover_slots module? (and compare the pg_failover_slots numbers with the > > "one worker" case here). Idea is to check if the patch does introduce > > some overhead as compare to pg_failover_slots. > > > > Yes, definitely. We will work on that and share the numbers soon. > We are working on these tests. Meanwhile attaching the patches which attempt to implement below functionalities: 1) Remove extra slots on standby if those no longer exist on primary or are no longer part of synchronize_slot_names. 2) Make synchronize_slot_names user-modifiable. And due to change in 'synchronize_slot_names', if dbids list is reduced, then take care of removal of extra/old db-ids (if any) from workers db-list. Thanks Ajin for working on 1. Both the above changes are in patch-0002. There is a test failure in the recovery module due to these new changes, I am looking into it and will fix it in the next version. Improvements in pipeline: a) standby slots should not be consumable. b) optimization of query which standby sends to primary. Currently it has dbid filter and slot-name filter. Slot-name filter can be optimized to have only those slots which belong to DBs assigned to the worker rather than having all 'synchronize_slot_names'. c) analyze if the naptime of the slot-sync worker can be auto-tuned. If there is no activity going on (i.e. slots are not advancing on primary) then increase naptime of slot-sync worker on standby and decrease it again when activity starts. thanks Shveta
Вложения
В списке pgsql-hackers по дате отправления: