RE: Logical replication prefetch

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: Logical replication prefetch
Дата
Msg-id OS0PR01MB57164F6986446A4F18891F1E9454A@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Logical replication prefetch  (Konstantin Knizhnik <knizhnik@garret.ru>)
Ответы Re: Logical replication prefetch
Список pgsql-hackers
On Tuesday, July 8, 2025 2:36 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> 
> There is well known Postgres problem that logical replication subscriber can
> not caught-up with publisher just because LR changes are applied by single
> worker and at publisher changes are made by multiple concurrent backends.
> The problem is not logical replication
> specific: physical replication stream is also handled by single walreceiver. But
> for physical replication Postgres now implements
> prefetch: looking at WAL record blocks it is quite easy to predict which pages
> will be required for redo and prefetch them. With logical replication situation is
> much more complicated.
> 
> My first idea was to implement parallel apply of transactions. But to do it we
> need to track dependencies between transactions. Right now Postgres can
> apply transactions in parallel, but only if they are streamed  (which is done
> only for large transactions) and serialize them by commits. It is possible to
> enforce parallel apply of short transactions using
> `debug_logical_replication_streaming` but then performance is ~2x times
> slower than in case of sequential apply by single worker. By removing
> serialization by commits, it is possible to speedup apply 3x times and make
> subscriber apply changes faster then producer can produce them even with
> multiple clients. But it is possible only if transactions are independent and it
> can be enforced only by tracking dependencies which seems to be very
> non-trivial and invasive.
> 
> I still do not completely give up with tracking dependencies approach, but
> decided first to try more simple solution - prefetching. It is already used for
> physical replication. Certainly in case of physical replication it is much simpler,
> because each WAL record contains list of accessed blocks.
> 
> In case of logical replication prefetching can be done either by prefetching
> access to replica identity index (usually primary key), either by executing
> replication command by some background worker Certainly first case is much
> more easy. We just perform index lookup in prefetch worker and it loads
> accessed index and heap pages in shared buffer, so main apply worker does
> not need to read something from disk.
> But it works well only for DELETE and HOT UPDATE operations.
> 
> In the second case we normally execute the LR command in background
> worker and then abort transaction. Certainly in this case we are doing the same
> work twice. But assumption is the same: parallel prefetch workers should load
> affected pages, speeding up work of the main apply worker.
> 
> I have implemented some PoC (see attached patch). And get first results of
> efficiency of such prefetching.
> 
> *** First scenario (update-only).
> 
> Publisher:
> ```
> create table t(pk integer primary key, counter integer, filler text default repeat('x',
> 1000)) with (fillfactor=10); insert into t values (generate_series(1,100000), 0);
> create publication pub1 for table t; ```
> 
> Subscriber:
> ```
> create table t(pk integer primary key, counter integer, filler text default repeat('x',
> 1000)) with (fillfactor=10); create subscription sub1 connection 'port=54321
> dbname=postgres'
> publication pub1;
> ```
> 
> Then I wait until replication is synced, stop subscriber and do random dot
> updates in 10 sessions at publisher:
> 
> ```
> pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres ```
> 
> where update.sql is:
> 
> ```
> \set pk random(1, 100000)
> update t set counter=counter+1 where pk=:pk; ```
> 
> Then I start subscriber and measure how much time is needed for it to caught
> up.
> Results:
> 
> no prefetch: 2:00 min
> prefetch (replica identity only): 0:55 min prefetch (all): 1:10 min
> 
> This is definitely the best case for replica-identity index only prefetch
> (update-only and no other indexes).
> How to interpret this results?
> 
> Without prefetch applying updates takes about two times  more at subscriber
> than performing this updates at publisher.
> It means that under huge workload subscriber has no chances to caught up.
> 
> With prefetching replica identity index, apply time is even smaller than time
> needed to perform updates at publisher.
> Performing the whole operation and transaction abort certainly adds more
> overhead. But still improvement is quite significant.
> 
> Please also notice that this results were obtains at the system with larger
> amount of RAM (64Gb) and fast SSD.
> With data set not fitting in RAM and much slower disks, the difference is
> expected to be more significant.
> I have tried to simulate it be adding 0.1msec delay to pg_preadv.
> When I add artificial 0.1msec `preadv` delay, I got the following results:
> 
> no prefetch: 7:40
> prefetch (replica identity only): 3:10 min prefetch (all): 3:09
> 
> 
> In this case apply takes much more time than 100 seconds during which
> updates are performed at publisher. Prefetch can improve speed about two
> times, but it doesn't allow subcriber to caught-up.
> 
> 
> 
> *** Second scenario: inserts with secondary random key.
> 
> 
> Publisher:
> 
> ```
> create table t(pk serial primary key, sk integer, counter integer default 0) insert
> into t (sk) select random()*10000000 from generate_series(1,10000000) create
> index on t(sk) create publication pub1 for table t ```
> 
> Subscriber:
> ```
> 
> create table t(pk integer primary key, sk integer, counter integer) create index on
> t(sk) create subscription sub1 connection 'port=54321 dbname=postgres'
> publication pub1
> ```
> 
> workload:
> 
> ```
> pgbench -T 100 -c 10 -M prepared -n -f insert.sql -p 54321 -d postgres
> 
> ```
> 
> where insert.sql:
> 
> ```
> INSERT INTO t (sk) VALUES (random()*10000000); ```
> 
> Results (with 0.1msec delay)  are the followingL
> 
> no prefetch: 10:10 min
> prefetch (identity): 8:25 min
> prefetch (full): 5:50min
> 
> Here as expected prefetching only primary key doesn't provide some big
> improvement. But replaying insert command in prefetch worker allows to
> speedup apply almost twice.
> 
> Please notice that this approach requires minimal changes in Postgres,
> because all infrastructure of parallel apply workers is already present and we
> can reuse the same apply code (with minimal changes) for performing prefetch.
> I only have to introduce extra tuple lock types (no-lock and try-lock) to minimize
> overhead and lock conflicts between prefetch and main apply workers. Still it
> can not completely prevent locks conflicts and deadlocks in prefetch workers.
> Looks like more work is needed here. Also I set `wal_level=minimal` in
> prefetch workers to avoid  WAL-logging overhead.
> 
> Number of prefetch workers is specified by
> `max_parallel_prefetch_workers_per_subscription` GUC. If it is zero
> (default) then no prefetching is performed.
> Prefetch mode is controlled by `prefetch_replica_identity_only` GUC . By
> default it is true which makes prefetch efficient for hot updates, deletes or
> inserts in table with just one index (primary key).
> 
> 
> Attached please find patch and two shell scripts used to produce this test
> results.
> Also it may be more convenient to inspect this patch as PR:
> https://github.com/knizhnik/postgres/pull/3
> 
> I wonder if such LR prefetching approach is considered to be useful?
> Or it is better to investigate other ways to improve LR apply speed (parallel
> apply)?

Thank you for the proposal ! I find it to be a very interesting feature。

I tested the patch you shared in your original email and encountered potential
deadlocks when testing pgbench TPC-B like workload. Could you please provide an
updated patch version so that I can conduct further performance experiments ?

Additionally, I was also exploring ways to improve performance and have tried an
alternative version of prefetch for experimentation. The alternative design is
that we assigns each non-streaming transaction to a parallel apply worker, while
strictly maintaining the order of commits. During parallel apply, if the
transactions that need to be committed before the current transaction are not
yet finished, the worker performs pre-fetch operations. Specifically, for
updates and deletes, the worker finds and caches the target local tuple to be
updated/deleted. Once all preceding transactions are committed, the parallel
apply worker uses these cached tuples to execute the actual updates or deletes.
What do you think about this alternative ? I think the alternative might offer
more stability in scenarios where shared buffer elimination occurs frequently
and avoids leaving dead tuples in the buffer. However, it also presents some
drawbacks, such as the need to add wait events to maintain commit order,
compared to the approach discussed in this thread.

(Note that, due to time constraints, I have not implemented the pre-fetch for
Inserts and the code is not in reviewable shape and lacks comments and
documentation, but just share the POC patch for reference).

Best Regards,
Hou zj

Вложения

В списке pgsql-hackers по дате отправления: