Re: Syncrep and improving latency due to WAL throttling

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Syncrep and improving latency due to WAL throttling
Дата
Msg-id 20231108064008.nh4wb5g5ucwffeil@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: Syncrep and improving latency due to WAL throttling  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Syncrep and improving latency due to WAL throttling  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
Hi,

On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
> scope
> -----
> Now, let's talk about scope - what the patch does not aim to do. The
> patch is explicitly intended for syncrep clusters, not async. There have
> been proposals to also support throttling for async replicas, logical
> replication etc. I suppose all of that could be implemented, and I do
> see the benefit of defining some sort of maximum lag even for async
> replicas. But the agreement was to focus on the syncrep case, where it's
> particularly painful, and perhaps extend it in the future.

Perhaps we should take care to make the configuration extensible in that
direction in the future?


Hm - is this feature really tied to replication, at all? Pretty much the same
situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
and -P1. Guess where I started a VACUUM (on a fully cached table, so no
continuous WAL flushes):

progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed

At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.

Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
"good times" show a delay of ~8kB (note that this includes WAL records that
are still being inserted). Once the VACUUM runs, it's ~2-3MB.

The picture with more clients is similar.

If I instead severely limit the amount of outstanding (but not the amount of
unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
(down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
that's ridiculous and will completely trash performance in many other cases,
but it shows that limiting the amount of outstanding WAL could help without
replication as well.  With remote storage, that'd likely be a bigger
difference.




> problems
> --------
> Now let's talk about some problems - both conceptual and technical
> (essentially review comments for the patch).
>
> 1) The goal of the patch is to limit the impact on latency, but the
> relationship between WAL amounts and latency may not be linear. But we
> don't have a good way to predict latency, and WAL lag is the only thing
> we have, so there's that. Ultimately, it's a best effort.

It's indeed probably not linear. Realistically, to do better, we probably need
statistics for the specific system in question - the latency impact will
differ hugely between different storage/network.


> 2) The throttling is per backend. That makes it simple, but it means
> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
> and with a single backend that works fine - the lag should not exceed
> the 8MB value. But if there are N backends, the lag could be up to
> N-times 8MB, I believe. That's a bit annoying, but I guess the only
> solution would be to have some autovacuum-like cost balancing, with all
> backends (or at least those running large stuff) doing the checks more
> often. I'm not sure we want to do that.

Hm. The average case is likely fine - the throttling of the different backends
will intersperse and flush more frequently - but the worst case is presumably
part of the issue here. I wonder if we could deal with this by somehow
offsetting the points at which backends flush at somehow.

I doubt we want to go for something autovacuum balancing like - that doesn't
seem to work well - but I think we could take the amount of actually unflushed
WAL into account when deciding whether to throttle. We have the necessary
state in local memory IIRC. We'd have to be careful to not throttle every
backend at the same time, or we'll introduce latency penalties that way. But
what if we scaled synchronous_commit_wal_throttle_threshold depending on the
amount of unflushed WAL? By still taking backendWalInserted into account, we'd
avoid throttling everyone at the same time, but still would make throttling
more aggressive depending on the amount of unflushed/unreplicated WAL.


> 3) The actual throttling (flush and wait for syncrep) happens in
> ProcessInterrupts(), which mostly works but it has two drawbacks:
>
>  * It may not happen "early enough" if the backends inserts a lot of
> XLOG records without processing interrupts in between.

Does such code exist? And if so, is there a reason not to fix said code?


>  * It may happen "too early" if the backend inserts enough WAL to need
> throttling (i.e. sets XLogDelayPending), but then after processing
> interrupts it would be busy with other stuff, not inserting more WAL.

> I think ideally we'd do the throttling right before inserting the next
> XLOG record, but there's no convenient place, I think. We'd need to
> annotate a lot of places, etc. So maybe ProcessInterrupts() is a
> reasonable approximation.

Yea, I think there's no way to do that with reasonable effort. Starting to
wait with a bunch of lwlocks held would obviously be bad.


> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
> that seems reasonable.

And independently beneficial.


> missing pieces
> --------------
> The thing that's missing is that some processes (like aggressive
> anti-wraparound autovacuum) should not be throttled. If people set the
> GUC in the postgresql.conf, I guess that'll affect those processes too,
> so I guess we should explicitly reset the GUC for those processes. I
> wonder if there are other cases that should not be throttled.

Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone
else even further, even though the goal of the feature might be the opposite.
I don't think that's warranted for anti-wraparound vacuums - they're normal. I
think failsafe vacuums are a different story - there we really just don't care
about impacting other backends, the goal is to prevent moving the cluster to
read only pretty soon.


> tangents
> --------
> While discussing this with Andres a while ago, he mentioned a somewhat
> orthogonal idea - sending unflushed data to the replica.
>
> We currently never send unflushed data to the replica, which makes sense
> because this data is not durable and if the primary crashes/restarts,
> this data will disappear. But it also means there may be a fairly large
> chunk of WAL data that we may need to send at COMMIT and wait for the
> confirmation.
>
> He suggested we might actually send the data to the replica, but the
> replica would know this data is not flushed yet and so would not do the
> recovery etc. And at commit we could just send a request to flush,
> without having to transfer the data at that moment.
>
> I don't have a very good intuition about how large the effect would be,
> i.e. how much unflushed WAL data could accumulate on the primary
> (kilobytes/megabytes?),

Obviously heavily depends on the workloads. If you have anything with bulk
writes it can be many megabytes.


> and how big is the difference between sending a couple kilobytes or just a
> request to flush.

Obviously heavily depends on the network...


I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
network (albeit with a crappy external card for my laptop), to put some
numbers to this. I used -r $s,100 to test sending a variable sized data to the
other size, with the other side always responding with 100 bytes (assuming
that'd more than fit a feedback response).

Command:
fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;
fors in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done 

10gbe:

request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10              100             43              64.30           390             96              15526.084
100             100             57              75.12           428             122             13286.602
1000            100             47              74.41           270             108             13412.125
10000           100             89              114.63          712             152             8700.643
100000          100             167             255.90          584             312             3903.516
1000000         100             891             1015.99         2470            1143            983.708


Same hosts, but with my workstation forced to use a 1gbit connection:

request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10              100             78              131.18          2425            257             7613.416
100             100             81              129.25          425             255             7727.473
1000            100             100             162.12          1444            266             6161.388
10000           100             310             686.19          1797            927             1456.204
100000          100             1006            1114.20         1472            1199            896.770
1000000         100             8338            8420.96         8827            8498            118.410

I haven't checked, but I'd assume that 100bytes back and forth should easily
fit a new message to update LSNs and the existing feedback response. Even just
the difference between sending 100 bytes and sending 10k (a bit more than a
single WAL page) is pretty significant on a 1gbit network.

Of course, the relatively low latency between these systems makes this more
pronounced than if this were a cross regional or even cross continental link,
were the roundtrip latency is more likely to be dominated by distance rather
than throughput.

Testing between europe and western US:
request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
10              100             157934          167627.12       317705          160000          5.652
100             100             161294          171323.59       324017          170000          5.530
1000            100             161392          171521.82       324629          170000          5.524
10000           100             163651          173651.06       328488          170000          5.456
100000          100             166344          198070.20       638205          170000          4.781
1000000         100             225555          361166.12       1302368         240000          2.568


No meaningful difference before getting to 100k. But it's pretty easy to lag
by 100k on a longer distance link...

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: vignesh C
Дата:
Сообщение: Re: pg_upgrade and logical replication
Следующее
От: Peter Smith
Дата:
Сообщение: ensure, not insure