Re: wake up logical workers after ALTER SUBSCRIPTION
| От | Nathan Bossart | 
|---|---|
| Тема | Re: wake up logical workers after ALTER SUBSCRIPTION | 
| Дата | |
| Msg-id | 20230104181219.GB296349@nathanxps13 обсуждение исходный текст | 
| Ответ на | Re: wake up logical workers after ALTER SUBSCRIPTION (Amit Kapila <amit.kapila16@gmail.com>) | 
| Ответы | Re: wake up logical workers after ALTER SUBSCRIPTION | 
| Список | pgsql-hackers | 
On Wed, Jan 04, 2023 at 10:57:43AM +0530, Amit Kapila wrote: > On Tue, Jan 3, 2023 at 11:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> My approach was to add a variable to LogicalRepWorker that indicated >> whether a worker needed to be restarted immediately. While this is a >> little weird because the workers array is treated as slots, it worked >> nicely for ALTER SUBSCRIPTION. > > So, are you planning to keep its in_use and subid flag as it is in > logicalrep_worker_cleanup()? Otherwise, without that it could be > reused for some other subscription. I believe I did something like this in my proof-of-concept. I might have used the new flag as another indicator that the slot was still "in use". In any case, you are right that we need to prevent the slot from being reused. > What if we maintain a hash table similar to 'last_start_times' > maintained in tablesync.c? It won't have entries for new > subscriptions, so for those we may not need to wait till > wal_retrieve_retry_interval. I proposed this upthread [0]. I still think it is a worthwhile change. Right now, if a worker needs to be restarted but another unrelated worker was restarted less than wal_retrieve_retry_interval milliseconds ago, the launcher waits to restart it. I think it makes more sense for each worker to have its own restart interval tracked. >> IIUC you are suggesting just one variable that would bypass >> wal_retrieve_retry_interval for all subscriptions, not just those newly >> altered or created. This definitely seems like it would prevent delays, >> but it would also cause wal_retrieve_retry_interval to be incorrectly >> bypassed for the other workers in some cases. > > Right, but I guess it would be rare in practical cases that someone > Altered/Created a subscription, and also some workers are restarted > due to errors/crashes as only in those cases launcher can restart the > worker when it shouldn't. However, in that case, also, it won't > restart the apply worker again and again unless there are concurrent > Create/Alter Subscription operations going on. IIUC, currently also it > can always first time restart the worker immediately after ERROR/CRASH > because we don't maintain last_start_time for each worker. I think > this is probably okay as we want to avoid repeated restarts after the > ERROR. This line of thinking is why I felt that lowering wal_retrieve_retry_interval for the tests might be sufficient. Besides the fact that it revealed multiple bugs, I don't see the point in adding much more complexity here. In practice, workers will usually start right away, unless of course there are other worker starts happening around the same time. This consistently causes testing delays because the tests stress these code paths, but I don't think what the tests are doing is a typical use-case. From the discussion thus far, it sounds like the alternatives are to 1) add a global flag that causes wal_retrieve_retry_interval to be bypassed for all workers or to 2) add a hash map in the launcher and a restart_immediately flag in each worker slot. I'll go ahead and create a patch for 2 since it seems like the most complete solution, and we can evaluate whether the complexity seems appropriate. [0] https://postgr.es/m/20221214171023.GA689106%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: