Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Дата
Msg-id 71f3e6fb-2fca-a798-856a-f23c8ede2333@garret.ru
обсуждение исходный текст
Ответ на Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes  ("Bossart, Nathan" <bossartn@amazon.com>)
Ответы Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers

On 11.01.2022 03:06, Bossart, Nathan wrote:
> I noticed this thread and thought I'd share my experiences building
> something similar for Multi-AZ DB clusters [0].  It's not a strict RPO
> mechanism, but it does throttle backends in an effort to keep the
> replay lag below a configured maximum.  I can share the code if there
> is interest.
>
> I wrote it as a new extension, and except for one piece that I'll go
> into later, I was able to avoid changes to core PostgreSQL code.  The
> extension manages a background worker that periodically assesses the
> state of the designated standbys and updates an atomic in shared
> memory that indicates how long to delay.  A transaction callback
> checks this value and sleeps as necessary.  Delay can be injected for
> write-enabled transactions on the primary, read-only transactions on
> the standbys, or both.  The extension is heavily configurable so that
> it can meet the needs of a variety of workloads.
>
> One interesting challenge I encountered was accurately determining the
> amount of replay lag.  The problem was twofold.  First, if there is no
> activity on the primary, there will be nothing to replay on the
> standbys, so the replay lag will appear to grow unbounded.  To work
> around this, the extension's background worker periodically creates an
> empty COMMIT record.  Second, if a standby reconnects after a long
> time, the replay lag won't be accurate for some time.  Instead, the
> replay lag will slowly increase until it reaches the correct value.
> Since the delay calculation looks at the trend of the replay lag, this
> apparent unbounded growth causes it to inject far more delay than is
> necessary.  My guess is that this is related to 9ea3c64, and maybe it
> is worth rethinking that logic.  For now, the extension just
> periodically reports the value of GetLatestXTime() from the standbys
> to the primary to get an accurate reading.  This is done via a new
> replication callback mechanism (which requires core PostgreSQL
> changes).  I can share this patch along with the extension, as I bet
> there are other applications for it.
>
> I should also note that the extension only considers "active" standbys
> and primaries.  That is, ones with an active WAL sender or WAL
> receiver.  This avoids the need to guess what should be done during a
> network partition, but it also means that we must gracefully handle
> standbys reconnecting with massive amounts of lag.  The extension is
> designed to slowly ramp up the amount of injected delay until the
> standby's apply lag is trending down at a sufficient rate.
>
> I see that an approach was suggested upthread for throttling based on
> WAL distance instead of per-transaction.  While the transaction
> approach works decently well for certain workloads (e.g., many small
> transactions like those from pgbench), it might require further tuning
> for very large transactions or workloads with a variety of transaction
> sizes.  For that reason, I would definitely support building a way to
> throttle based on WAL generation.  It might be a good idea to avoid
> throttling critical activity such as anti-wraparound vacuuming, too.
>
> Nathan
>
> [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html
>

We have faced with the similar problem in Zenith (open source Aurora) 
and have to implement back pressure mechanism to prevent overflow of WAL 
at stateless compute nodes
and too long delays of [age reconstruction. Our implementation is the 
following:
1. Three GUCs are added: max_replication_write/flush/apply_lag
2. Replication lags are checked in XLogInsert and if one of 3 thresholds 
is reached then InterruptPending is set.
3. In ProcessInterrupts we block backend execution until lag is within 
specified boundary:

     #define BACK_PRESSURE_DELAY 10000L // 0.01 sec
     while(true)
     {
         ProcessInterrupts_pg();

         // Suspend writers until replicas catch up
         lag = backpressure_lag();
         if (lag <= 0)
             break;

         set_ps_display("backpressure throttling");

         elog(DEBUG2, "backpressure throttling: lag %lu", lag);
         pg_usleep(BACK_PRESSURE_DELAY);
     }

What is wrong here is that backend can be blocked for a long time 
(causing failure of client application due to timeout expiration) and 
hold acquired locks while sleeping.
We are thinking about smarter way of choosing throttling delay (for 
example exponential increase of throttling sleep interval until some 
maximal value is reached).
But it is really hard to find some universal schema which will be good 
for all use cases (for example in case of short living session, which 
clients are connected to the server to execute just one query).

Concerning throttling at the end of transaction which eliminates problem 
with holding locks and do not require changes in postgres core, 
unfortunately it doesn't address problem with large transactions (for 
example bulk load of data using COPY). In this case just one transaction 
can cause arbitrary large lag.

I am not sure how critical is the problems with holding locks during 
throttling: yes, it may block other database activity, including vacuum 
and execution of read-only queries.
But it should not block walsender and so cause deadlock. And in most 
cases read-only transactions are not conflicting with write transaction, 
so suspending write transaction
should not block readers.

Another problem with throttling is large WAL records (for example custom 
logical replication WAL record can be arbitrary large). If such record 
is larger than replication lag limit,
then it can cause deadlock.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: ICU for global collation