Обсуждение: Reduce useless changes before reassembly during logical replication
Hi hackers,
During logical replication, if there is a large write transaction, some
spill files will be written to disk, depending on the setting of
logical_decoding_work_mem.
This behavior can effectively avoid OOM, but if the transaction
generates a lot of change before commit, a large number of files may
fill the disk. For example, you can update a TB-level table.
However, I found an inelegant phenomenon. If the modified large table is not
published, its changes will also be written with a large number of spill files.
Look at an example below:
publisher:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE PUBLICATION mypub FOR TABLE public.tbl_pub;
```
subscriber:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432
user=postgres dbname=postgres' PUBLICATION mypub;
```
publisher:
```
begin;
insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999999) i;
```
Later you will see a large number of spill files in the
"/$PGDATA/pg_replslot/mysub/" directory.
```
$ll -sh
total 4.5G
4.0K -rw------- 1 postgres postgres 200 Nov 30 09:24 state
17M -rw------- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-10000000.spill
12M -rw------- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-1000000.spill
17M -rw------- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-11000000.spill
......
```
We can see that table tbl_t1 is not published in mypub. It also won't be sent
downstream because it's not subscribed.
After the transaction is reorganized, the pgoutput decoding plugin filters out
changes to these unpublished relationships when sending logical changes.
See function pgoutput_change.
Most importantly, if we filter out unpublished relationship-related
changes after constructing the changes but before queuing the changes
into a transaction, will it reduce the workload of logical decoding
and avoid disk
or memory growth as much as possible?
The patch in the attachment is a prototype, which can effectively reduce the
memory and disk space usage during logical replication.
Design:
1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.
2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
Its main implementation is based on the table filter in the
pgoutput_change function.
Its main function is to determine whether the change needs to be published based
on the parameters of the publication, and if not, filter it.
3. After constructing a change and before Queue a change into a transaction,
use RelidByRelfilenumber to obtain the relation associated with the change,
just like obtaining the relation in the ReorderBufferProcessTXN function.
4. Relation may be a toast, and there is no good way to get its real
table relation based on toast relation. Here, I get the real table oid
through toast relname, and then get the real table relation.
5. This filtering takes into account INSERT/UPDATE/INSERT. Other
changes have not been considered yet and can be expanded in the future.
Test:
1. Added a test case 034_table_filter.pl
2. Like the case above, create two tables, the published table tbl_pub and
the non-published table tbl_t1
3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
pg_ls_replslotdir to record the total size of the slot directory
every second.
4. Compare the size of the slot directory at the beginning of the
transaction(size1),
the size at the end of the transaction (size2), and the average
size of the entire process(size3).
5. Assert(size1==size2==size3)
Sincerely look forward to your feedback.
Regards, lijie
Вложения
On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote:
>
> Hi hackers,
>
> During logical replication, if there is a large write transaction, some
> spill files will be written to disk, depending on the setting of
> logical_decoding_work_mem.
>
> This behavior can effectively avoid OOM, but if the transaction
> generates a lot of change before commit, a large number of files may
> fill the disk. For example, you can update a TB-level table.
>
> However, I found an inelegant phenomenon. If the modified large table is not
> published, its changes will also be written with a large number of spill files.
> Look at an example below:
Thanks. I agree that decoding and queuing the changes of unpublished
tables' data into reorder buffer is an unnecessary task for walsender.
It takes processing efforts (CPU overhead), consumes disk space and
uses memory configured via logical_decoding_work_mem for a replication
connection inefficiently.
> Later you will see a large number of spill files in the
>
> We can see that table tbl_t1 is not published in mypub. It also won't be sent
> downstream because it's not subscribed.
> After the transaction is reorganized, the pgoutput decoding plugin filters out
> changes to these unpublished relationships when sending logical changes.
> See function pgoutput_change.
Right. Here's my testing [1].
> Most importantly, if we filter out unpublished relationship-related
> changes after constructing the changes but before queuing the changes
> into a transaction, will it reduce the workload of logical decoding
> and avoid disk
> or memory growth as much as possible?
Right. It can.
> The patch in the attachment is a prototype, which can effectively reduce the
> memory and disk space usage during logical replication.
>
> Design:
> 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.
>
> 2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
> Its main implementation is based on the table filter in the
> pgoutput_change function.
> Its main function is to determine whether the change needs to be published based
> on the parameters of the publication, and if not, filter it.
>
> 3. After constructing a change and before Queue a change into a transaction,
> use RelidByRelfilenumber to obtain the relation associated with the change,
> just like obtaining the relation in the ReorderBufferProcessTXN function.
>
> 4. Relation may be a toast, and there is no good way to get its real
> table relation based on toast relation. Here, I get the real table oid
> through toast relname, and then get the real table relation.
>
> 5. This filtering takes into account INSERT/UPDATE/INSERT. Other
> changes have not been considered yet and can be expanded in the future.
Design of this patch is based on the principle of logical decoding
filtering things out early on and looks very similar to
filter_prepare_cb_wrapper/pg_decode_filter_prepare and
filter_by_origin_cb/pgoutput_origin_filter. Per my understanding this
design looks okay unless I'm missing anything.
> Test:
> 1. Added a test case 034_table_filter.pl
> 2. Like the case above, create two tables, the published table tbl_pub and
> the non-published table tbl_t1
> 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
> pg_ls_replslotdir to record the total size of the slot directory
> every second.
> 4. Compare the size of the slot directory at the beginning of the
> transaction(size1),
> the size at the end of the transaction (size2), and the average
> size of the entire process(size3).
> 5. Assert(size1==size2==size3)
I bet that the above test with 10K rows is going to take a noticeable
time on some buildfarm members (it took 6 seconds on my dev system
which is an AWS EC2 instance). And, the above test can get flaky.
Therefore, IMO, the concrete way of testing this feature is by looking
at the server logs for the following message using
PostgreSQL::Test::Cluster log_contains().
+filter_done:
+
+ if (result && RelationIsValid(relation))
+ elog(DEBUG1, "logical filter change by table %s",
RelationGetRelationName(relation));
+
Here are some comments on the v1 patch:
1.
@@ -1415,9 +1419,6 @@ pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
TupleTableSlot *old_slot = NULL;
TupleTableSlot *new_slot = NULL;
- if (!is_publishable_relation(relation))
- return;
-
Instead of removing is_publishable_relation from pgoutput_change, I
think it can just be turned into an assertion
Assert(is_publishable_relation(relation));, no?
2.
+ switch (change->action)
+ {
+ /* intentionally fall through */
Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the
code, otherwise a warning is thrown.
3. From commit message:
Most of the code in the FilterByTable function is transplanted from
the ReorderBufferProcessTXN
function, which can be called before the ReorderBufferQueueChange function.It is
I think the above note can just be above the FilterByTable function
for better understanding.
+static bool
+FilterByTable(LogicalDecodingContext *ctx, ReorderBufferChange *change)
+{
4. Why is FilterByTable(ctx, change) call placed after DecodeXLogTuple
in DecodeInsert, DecodeUpdate and DecodeDelete? Is there a use for
decoded tuples done by DecodeXLogTuple in the new callback
filter_by_table_cb? If not, can we move FilterByTable call before
DecodeXLogTuple to avoid some more extra processing?
5. Why is ReorderBufferChange needed as a parameter to FilterByTable
and filter_by_table_cb? Can't just the LogicalDecodingContext and
relation name, the change action be enough to decide if the table is
publishable or not? If done this way, it can avoid some more
processing, no?
6. Please run pgindent and pgperltidy on the new source code and new
TAP test file respectively.
[1]
HEAD:
postgres=# BEGIN;
BEGIN
Time: 0.110 ms
postgres=*# insert into tbl_t1 select i,repeat('xyzzy',
i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99999)
i;
INSERT 0 100000
Time: 379488.265 ms (06:19.488)
postgres=*#
ubuntu:~/postgres/pg17/bin$ du -sh
/home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
837M /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
ubuntu:~/postgres/pg17/bin$ du -sh /home/ubuntu/postgres/pg17/bin/db17
2.6G /home/ubuntu/postgres/pg17/bin/db17
PATCHED:
postgres=# BEGIN;
BEGIN
Time: 0.105 ms
postgres=*# insert into tbl_t1 select i,repeat('xyzzy',
i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99999)
i;
INSERT 0 100000
Time: 380044.554 ms (06:20.045)
ubuntu:~/postgres$ du -sh /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
8.0K /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub
ubuntu:~/postgres$ du -sh /home/ubuntu/postgres/pg17/bin/db17
1.8G /home/ubuntu/postgres/pg17/bin/db17
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote: > > > > Hi hackers, > > > > During logical replication, if there is a large write transaction, some > > spill files will be written to disk, depending on the setting of > > logical_decoding_work_mem. > > > > This behavior can effectively avoid OOM, but if the transaction > > generates a lot of change before commit, a large number of files may > > fill the disk. For example, you can update a TB-level table. > > > > However, I found an inelegant phenomenon. If the modified large table is not > > published, its changes will also be written with a large number of spill files. > > Look at an example below: > > Thanks. I agree that decoding and queuing the changes of unpublished > tables' data into reorder buffer is an unnecessary task for walsender. > It takes processing efforts (CPU overhead), consumes disk space and > uses memory configured via logical_decoding_work_mem for a replication > connection inefficiently. > This is all true but note that in successful cases (where the table is published) all the work done by FilterByTable(accessing caches, transaction-related stuff) can add noticeable overhead as anyway we do that later in pgoutput_change(). I think I gave the same comment earlier as well but didn't see any satisfactory answer or performance data for successful cases to back this proposal. Note, users can configure to stream_in_progress transactions in which case they shouldn't see such a big problem. However, I agree that if we can find some solution where there is no noticeable overhead then that would be worth considering. -- With Regards, Amit Kapila.
On Thu, Jan 18, 2024 at 2:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote: > > > > > > Hi hackers, > > > > > > During logical replication, if there is a large write transaction, some > > > spill files will be written to disk, depending on the setting of > > > logical_decoding_work_mem. > > > > > > This behavior can effectively avoid OOM, but if the transaction > > > generates a lot of change before commit, a large number of files may > > > fill the disk. For example, you can update a TB-level table. > > > > > > However, I found an inelegant phenomenon. If the modified large table is not > > > published, its changes will also be written with a large number of spill files. > > > Look at an example below: > > > > Thanks. I agree that decoding and queuing the changes of unpublished > > tables' data into reorder buffer is an unnecessary task for walsender. > > It takes processing efforts (CPU overhead), consumes disk space and > > uses memory configured via logical_decoding_work_mem for a replication > > connection inefficiently. > > > > This is all true but note that in successful cases (where the table is > published) all the work done by FilterByTable(accessing caches, > transaction-related stuff) can add noticeable overhead as anyway we do > that later in pgoutput_change(). Right. Overhead for published tables need to be studied. A possible way is to mark the checks performed in FilterByTable/filter_by_table_cb and skip the same checks in pgoutput_change. I'm not sure if this works without any issues though. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi,
>
> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change(). I think I gave the same comment
> earlier as well but didn't see any satisfactory answer or performance
> data for successful cases to back this proposal.
I did some benchmark yesterday at [1] and found it adds 20% cpu time.
then come out a basic idea, I think it deserves a share. "transaction
related stuff" comes from the syscache/systable access except the
HistorySansphot. and the syscache is required in the following
sistuations:
1. relfilenode (from wal) -> relid.
2. relid -> namespaceid (to check if the relid is a toast relation).
3. if toast, get its origianl relid.
4. access the data from pg_publication_tables.
5. see if the relid is a partition, if yes, we may get its root
relation.
Acutally we already has a RelationSyncCache for #4, and it *only* need
to access syscache when replicate_valid is false, I think this case
should be rare, but the caller doesn't know it, so the caller must
prepare the transaction stuff in advance even in the most case they are
not used. So I think we can get a optimization here.
then the attached patch is made.
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Wed Feb 21 18:40:03 2024 +0800
Make get_rel_sync_entry less depending on transaction state.
get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.
This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.
Then comes to #1, acutally we have RelfilenumberMapHash as a cache, when
the cache is hit (suppose this is a usual case), no transaction stuff
related. I have two ideas then:
1. Optimize the cache hit sistuation like what we just did for
get_rel_sync_entry for the all the 5 kinds of data and only pay the
effort for cache miss case. for the data for #2, #3, #5, all the keys
are relid, so I think a same HTAB should be OK.
2. add the content for #1, #2, #3, #5 to wal when wal_level is set to
logical.
In either case, the changes for get_rel_sync_entry should be needed.
> Note, users can
> configure to stream_in_progress transactions in which case they
> shouldn't see such a big problem.
People would see the changes is spilled to disk, but the CPU cost for
Reorder should be still paid I think.
[1] https://www.postgresql.org/message-id/87o7cadqj3.fsf%40163.com
--
Best Regards
Andy Fan
Вложения
Hi, Sorry I replied too late. > This is all true but note that in successful cases (where the table is > published) all the work done by FilterByTable(accessing caches, > transaction-related stuff) can add noticeable overhead as anyway we do > that later in pgoutput_change(). You are correct. Frequent opening of transactions and access to cache will cause a lot of overhead, which Andy has tested and proved. The root cause is because every dml wal record needs to do this, which is really wasteful. I use a hash table LocatorFilterCache to solve this problem. After getting a RelFileLocator, I go to the hash table to check its PublicationActions and filter it based on the PublicationActions to determine whether it has been published. The effect of my test is very obvious: (perf record) v1: Children Self Command Shared O Symbol + 22.04% 1.53% postgres postgres [.] FilterByTable v2: Children Self Command Shared O Symbol + 0.58% 0.00% postgres postgres [.] ReorderBufferFilterByLocator v1 patch introduces 20% overhead, while v2 only has 0.58%. > Note, users can >configure to stream_in_progress transactions in which case they > shouldn't see such a big problem. Yes, stream mode can prevent these irrelevant changes from being written to disk or sent to downstream. However, CPU and memory consumption will also be incurred when processing these useless changes. Here is my simple test[1]: base on master : CPU stat: perf stat -p pid -e cycles -I 1000 # time counts unit events 76.007070936 9,691,035 cycles 77.007163484 5,977,694 cycles 78.007252533 5,924,703 cycles 79.007346862 5,861,934 cycles 80.007438070 5,858,264 cycles 81.007527122 6,408,759 cycles 82.007615711 6,397,988 cycles 83.007705685 5,520,407 cycles 84.007794387 5,359,162 cycles 85.007884879 5,194,079 cycles 86.007979797 5,391,270 cycles 87.008069606 5,474,536 cycles 88.008162827 5,594,190 cycles 89.008256327 5,610,023 cycles 90.008349583 5,627,350 cycles 91.008437785 6,273,510 cycles 92.008527938 580,934,205 cycles 93.008620136 4,404,672 cycles 94.008711818 4,599,074 cycles 95.008805591 4,374,958 cycles 96.008894543 4,300,180 cycles 97.008987582 4,157,892 cycles 98.009077445 4,072,178 cycles 99.009163475 4,043,875 cycles 100.009254888 5,382,667 cycles memory stat: pistat -p pid -r 1 10 07:57:18 AM UID PID minflt/s majflt/s VSZ RSS %MEM Command 07:57:19 AM 1000 11848 233.00 0.00 386872 81276 0.01 postgres 07:57:20 AM 1000 11848 235.00 0.00 387008 82068 0.01 postgres 07:57:21 AM 1000 11848 236.00 0.00 387144 83124 0.01 postgres 07:57:22 AM 1000 11848 236.00 0.00 387144 83916 0.01 postgres 07:57:23 AM 1000 11848 236.00 0.00 387280 84972 0.01 postgres 07:57:24 AM 1000 11848 334.00 0.00 337000 36928 0.00 postgres 07:57:25 AM 1000 11848 3.00 0.00 337000 36928 0.00 postgres 07:57:26 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres 07:57:27 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres 07:57:28 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres Average: 1000 11848 151.30 0.00 362045 60000 0.01 postgres After patched: # time counts unit events 76.007623310 4,237,505 cycles 77.007717436 3,989,618 cycles 78.007813848 3,965,857 cycles 79.007906412 3,601,715 cycles 80.007998111 3,670,835 cycles 81.008092670 3,495,844 cycles 82.008187456 3,822,695 cycles 83.008281335 5,034,146 cycles 84.008374998 3,867,683 cycles 85.008470245 3,996,927 cycles 86.008563783 3,823,893 cycles 87.008658628 3,825,472 cycles 88.008755246 3,823,079 cycles 89.008849719 3,966,083 cycles 90.008945774 4,012,704 cycles 91.009044492 4,026,860 cycles 92.009139621 3,860,912 cycles 93.009242485 3,961,533 cycles 94.009346304 3,799,897 cycles 95.009440164 3,959,602 cycles 96.009534251 3,960,405 cycles 97.009625904 3,762,581 cycles 98.009716518 4,859,490 cycles 99.009807720 3,940,845 cycles 100.009901399 3,888,095 cycles 08:01:47 AM UID PID minflt/s majflt/s VSZ RSS %MEM Command 08:01:48 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:49 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:50 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:51 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:52 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:53 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:54 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:55 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:56 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:57 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres Average: 1000 19466 0.00 0.00 324424 15140 0.00 postgres Through comparison, it is found that patch is also profitable for stream mode. Of course, LocatorFilterCache also need to deal with invalidation, such as the corresponding relation invalidate, or pg_publication changes, just like RelationSyncCache and RelfilenumberMapHash. But ddl is a small amount after all, which is insignificant compared to a large amount of dml. Another problem is that the LocatorFilterCache looks redundant compared to RelationSyncCache and RelfilenumberMapHash. like this: 1. RelfilenumberMapHash: relfilenode -> relation oid 2. RelationSyncCache: relation oid-> PublicationActions 3. LocatorFilterCache: RelFileLocator-> PublicationActions The reason is that you cannot simply access two caches from the relfilenode --> PublicationActions, and you must use historical snapshots to access transactions and relcache in the middle, so there is no good solution for this for the time being, ugly but effective. >Therefore, IMO, the concrete way of testing this feature is by looking >at the server logs for the following message using >PostgreSQL::Test::Cluster log_contains(). thinks, done. >Instead of removing is_publishable_relation from pgoutput_change, I >think it can just be turned into an assertion >Assert(is_publishable_relation(relation));, no? yes, done. >Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the >code, otherwise a warning is thrown. /* intentionally fall through */ can also avoid warnings. >Can't just the LogicalDecodingContext and >relation name, the change action be enough to decide if the table is >publishable or not? If done this way, it can avoid some more >processing, no? yes, RelFileLocator filtering is used directly in v2, and change is no longer required. >Please run pgindent and pgperltidy on the new source code and new >TAP test file respectively. ok. [1]: https://www.postgresql.org/message-id/CAGfChW62f5NTNbLsqO-6_CrmKPqBEQtWPcPDafu8pCwZznk%3Dxw%40mail.gmail.com