RE: speed up a logical replica setup

Поиск
Список
Период
Сортировка
От Hayato Kuroda (Fujitsu)
Тема RE: speed up a logical replica setup
Дата
Msg-id TYCPR01MB12077E98F930C3DE6BD304D0DF54C2@TYCPR01MB12077.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: speed up a logical replica setup  ("Euler Taveira" <euler@eulerto.com>)
Список pgsql-hackers
Dear Euler,

Thanks for updating the patch!
Before reviewing deeply, here are replies for your comments.

>
> > Points raised by me [1] are not solved yet.
> > 
> > * What if the target version is PG16-?
pg_ctl and pg_resetwal won't work.
$ pg_ctl start -D /tmp/blah
waiting for server to start....
2024-02-15 23:50:03.448 -03 [364610] FATAL:  database files are incompatible with server
2024-02-15 23:50:03.448 -03 [364610] DETAIL:  The data directory was initialized by PostgreSQL version 16, which is not
compatiblewith this version 17devel.
 
stopped waiting
pg_ctl: could not start server
Examine the log output.

$ pg_resetwal -D /tmp/blah
pg_resetwal: error: data directory is of wrong version
pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible with this program's version "17".

> > * What if the found executables have diffent version with pg_createsubscriber?

The new code take care of it.
>

I preferred to have a common path and test one by one, but agreed this worked well.
Let's keep it and hear opinions from others.

>
> > * What if the target is sending WAL to another server?
> >    I.e., there are clusters like `node1->node2-.node3`, and the target is node2.
The new code detects if the server is in recovery and aborts as you suggested.
A new option can be added to ignore the fact there are servers receiving WAL
from it.
>

Confirmed it can detect.

>
> > * Can we really cleanup the standby in case of failure?
> >    Shouldn't we suggest to remove the target once?

If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid
it. However, we should provide instructions to inform the user that it should
create a fresh standby and try again.
>

I think the cleanup function looks not sufficient. In v21, recovery_ended is kept
to true even after drop_publication() is done in setup_subscriber(). I think that
made_subscription is not needed, and the function should output some messages
when recovery_ended is on.
Besides, in case of pg_upgrade, they always report below messages before starting
the migration. I think this is more helpful for users.

```
    pg_log(PG_REPORT, "\n"
           "If pg_upgrade fails after this point, you must re-initdb the\n"
           "new cluster before continuing.");
```

>
> > * Can we move outputs to stdout?

Are you suggesting to use another logging framework? It is not a good idea
because each client program is already using common/logging.c.
>

Hmm, indeed. Other programs in pg_basebackup seem to use the framework.

>
v20-0011: Do we really want to interrupt the recovery if the network was
momentarily interrupted or if the OS killed walsender? Recovery is critical for
the process. I think we should do our best to be resilient and deliver all
changes required by the new subscriber.
>

It might be too strict to raise an ERROR as soon as we met a disconnection.
And at least 0011 was wrong - it should be PQgetvalue(res, 0, 1) for still_alive.

>
The proposal is not correct because the
query return no tuples if it is disconnected so you cannot PQgetvalue().
>

Sorry for misunderstanding, but you might be confused. pg_createsubcriber
sends a query to target, and we are discussing the disconnection between the
target and source instances. I think the connection which pg_createsubscriber
has is stil alive so PQgetvalue() can get a value.

(BTW, callers of server_is_in_recovery() has not considered a disconnection from
the target...)

>
The
retry interval (the time that ServerLoop() will create another walreceiver) is
determined by DetermineSleepTime() and it is a maximum of 5 seconds
(SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up
using the pg_stat_wal_receiver query. Do you have a better plan?
>

It's good to determine the threshold. It can define the number  of acceptable
loss of walreceiver during the loop.
I think we should retry at least the postmaster revives the walreceiver.
The checking would work once per seconds, so more than 5 (or 10) may be better.
Thought?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bertrand Drouvot
Дата:
Сообщение: Re: 035_standby_logical_decoding unbounded hang
Следующее
От: Bertrand Drouvot
Дата:
Сообщение: Re: Synchronizing slots from primary to standby