Re: speed up a logical replica setup

Поиск
Список
Период
Сортировка
От Euler Taveira
Тема Re: speed up a logical replica setup
Дата
Msg-id c8e92bcb-69c3-477e-93d4-6f39e030cab0@app.fastmail.com
обсуждение исходный текст
Ответ на RE: speed up a logical replica setup  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Ответы Re: speed up a logical replica setup  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: speed up a logical replica setup  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the check
in wait_for_end_recovery() would be failed during the time. Since we allow to miss
the walreceiver 10 times and it is checked once per second, the failure occurs if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger,
but it's not a fundamental solution.

I was expecting that slow hosts might have issues in wait_for_end_recovery().
As you said it took a lot of steps between FinishWalRecovery() (where
walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
because it already finished the promotion and it is just doing the final
preparation for the host to become a primary.

        /*   
         * If it is still in recovery, make sure the target server is
         * connected to the primary so it can receive the required WAL to
         * finish the recovery process. If it is disconnected try
         * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
         */
        res = PQexec(conn,
                     "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
        if (PQntuples(res) == 0)
        {
            if (++count > NUM_CONN_ATTEMPTS)
            {
                stop_standby_server(subscriber_dir);
                pg_log_error("standby server disconnected from the primary");
                break;
            }    
        }
        else
            count = 0;          /* reset counter if it connects again */

This code was add to defend against the death/crash of the target server. There
are at least 3 options:

(1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this constant
and I decided to use 10 seconds because even in some slow hosts, this time
wasn't reached during my tests. It seems I forgot to test the combination of slow
host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
seconds as default wait. Maybe that's a reasonable value. I checked the
004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
(culicidae), 18.31s (calliphoridae) and 27.52s (olingo).

(2) check if the primary is not running when walreceiver is not available on the
target server. Increase the connection attempts iif the primary is not running.
Hence, the described case doesn't cause an increment on the count variable.

(3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
protection against the death/crash target server. I explained in a previous
message that timeout may occur in cases that WAL replay to reach consistent
state takes more than recovery-timeout seconds.

Option (1) is the easiest fix, however, we can have the same issue again if a
slow host decides to be even slower, hence, we have to adjust this value again.
Option (2) interprets the walreceiver absence as a recovery end and if the
primary server is running it can indicate that the target server is in the
imminence of the recovery end. Option (3) is not as resilient as the other
options.

The first patch implements a combination of (1) and (2).

## Analysis for failure 2

According to [2], the physical replication slot which is specified as primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one is
to ease the condition.

That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
NUM_ATTEMPTS in the first patch). See second patch.


--
Euler Taveira

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: RFC: Logging plan of the running query
Следующее
От: "Euler Taveira"
Дата:
Сообщение: Re: speed up a logical replica setup