RE: speed up a logical replica setup
От | Hayato Kuroda (Fujitsu) |
---|---|
Тема | RE: speed up a logical replica setup |
Дата | |
Msg-id | TYCPR01MB12077E98F930C3DE6BD304D0DF54C2@TYCPR01MB12077.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: speed up a logical replica setup ("Euler Taveira" <euler@eulerto.com>) |
Список | pgsql-hackers |
Dear Euler, Thanks for updating the patch! Before reviewing deeply, here are replies for your comments. > > > Points raised by me [1] are not solved yet. > > > > * What if the target version is PG16-? pg_ctl and pg_resetwal won't work. $ pg_ctl start -D /tmp/blah waiting for server to start.... 2024-02-15 23:50:03.448 -03 [364610] FATAL: database files are incompatible with server 2024-02-15 23:50:03.448 -03 [364610] DETAIL: The data directory was initialized by PostgreSQL version 16, which is not compatiblewith this version 17devel. stopped waiting pg_ctl: could not start server Examine the log output. $ pg_resetwal -D /tmp/blah pg_resetwal: error: data directory is of wrong version pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible with this program's version "17". > > * What if the found executables have diffent version with pg_createsubscriber? The new code take care of it. > I preferred to have a common path and test one by one, but agreed this worked well. Let's keep it and hear opinions from others. > > > * What if the target is sending WAL to another server? > > I.e., there are clusters like `node1->node2-.node3`, and the target is node2. The new code detects if the server is in recovery and aborts as you suggested. A new option can be added to ignore the fact there are servers receiving WAL from it. > Confirmed it can detect. > > > * Can we really cleanup the standby in case of failure? > > Shouldn't we suggest to remove the target once? If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid it. However, we should provide instructions to inform the user that it should create a fresh standby and try again. > I think the cleanup function looks not sufficient. In v21, recovery_ended is kept to true even after drop_publication() is done in setup_subscriber(). I think that made_subscription is not needed, and the function should output some messages when recovery_ended is on. Besides, in case of pg_upgrade, they always report below messages before starting the migration. I think this is more helpful for users. ``` pg_log(PG_REPORT, "\n" "If pg_upgrade fails after this point, you must re-initdb the\n" "new cluster before continuing."); ``` > > > * Can we move outputs to stdout? Are you suggesting to use another logging framework? It is not a good idea because each client program is already using common/logging.c. > Hmm, indeed. Other programs in pg_basebackup seem to use the framework. > v20-0011: Do we really want to interrupt the recovery if the network was momentarily interrupted or if the OS killed walsender? Recovery is critical for the process. I think we should do our best to be resilient and deliver all changes required by the new subscriber. > It might be too strict to raise an ERROR as soon as we met a disconnection. And at least 0011 was wrong - it should be PQgetvalue(res, 0, 1) for still_alive. > The proposal is not correct because the query return no tuples if it is disconnected so you cannot PQgetvalue(). > Sorry for misunderstanding, but you might be confused. pg_createsubcriber sends a query to target, and we are discussing the disconnection between the target and source instances. I think the connection which pg_createsubscriber has is stil alive so PQgetvalue() can get a value. (BTW, callers of server_is_in_recovery() has not considered a disconnection from the target...) > The retry interval (the time that ServerLoop() will create another walreceiver) is determined by DetermineSleepTime() and it is a maximum of 5 seconds (SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up using the pg_stat_wal_receiver query. Do you have a better plan? > It's good to determine the threshold. It can define the number of acceptable loss of walreceiver during the loop. I think we should retry at least the postmaster revives the walreceiver. The checking would work once per seconds, so more than 5 (or 10) may be better. Thought? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
В списке pgsql-hackers по дате отправления: