Re: speed up a logical replica setup

Поиск

Список

Период

Сортировка

От	Euler Taveira
Тема	Re: speed up a logical replica setup
Дата	26 марта 05:53:21
Msg-id	c8e92bcb-69c3-477e-93d4-6f39e030cab0@app.fastmail.com обсуждение исходный текст
Ответ на	RE: speed up a logical replica setup ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Ответы	Re: speed up a logical replica setup (Tomas Vondra <tomas.vondra@enterprisedb.com>) Re: speed up a logical replica setup (Amit Kapila <amit.kapila16@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:

## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the check
in wait_for_end_recovery() would be failed during the time. Since we allow to miss
the walreceiver 10 times and it is checked once per second, the failure occurs if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger,
but it's not a fundamental solution.

I was expecting that slow hosts might have issues in wait_for_end_recovery().

As you said it took a lot of steps between FinishWalRecovery() (where

walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to

RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *

WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision

because it already finished the promotion and it is just doing the final

preparation for the host to become a primary.

* If it is still in recovery, make sure the target server is

* connected to the primary so it can receive the required WAL to

* finish the recovery process. If it is disconnected try

* NUM_CONN_ATTEMPTS in a row and bail out if not succeed.

res = PQexec(conn,

"SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");

if (PQntuples(res) == 0)

{

if (++count > NUM_CONN_ATTEMPTS)

{

stop_standby_server(subscriber_dir);

pg_log_error("standby server disconnected from the primary");

break;

}

else

count = 0; /* reset counter if it connects again */

This code was add to defend against the death/crash of the target server. There

are at least 3 options:

(1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this constant

and I decided to use 10 seconds because even in some slow hosts, this time

wasn't reached during my tests. It seems I forgot to test the combination of slow

host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60

seconds as default wait. Maybe that's a reasonable value. I checked the

004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s

(culicidae), 18.31s (calliphoridae) and 27.52s (olingo).

(2) check if the primary is not running when walreceiver is not available on the

target server. Increase the connection attempts iif the primary is not running.

Hence, the described case doesn't cause an increment on the count variable.

(3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check

protection against the death/crash target server. I explained in a previous

message that timeout may occur in cases that WAL replay to reach consistent

state takes more than recovery-timeout seconds.

Option (1) is the easiest fix, however, we can have the same issue again if a

slow host decides to be even slower, hence, we have to adjust this value again.

Option (2) interprets the walreceiver absence as a recovery end and if the

primary server is running it can indicate that the target server is in the

imminence of the recovery end. Option (3) is not as resilient as the other

options.

The first patch implements a combination of (1) and (2).

## Analysis for failure 2

According to [2], the physical replication slot which is specified as primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one is
to ease the condition.

That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to

NUM_ATTEMPTS in the first patch). See second patch.

Euler Taveira

EDB https://www.enterprisedb.com/

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 26 марта, 05:34:47
Сообщение: Re: RFC: Logging plan of the running query

Следующее

От: "Euler Taveira"
Дата: 26 марта, 05:57:27
Сообщение: Re: speed up a logical replica setup

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: speed up a logical replica setup

Вложения

Предыдущее

Следующее