Обсуждение: Reduce useless changes before reassembly during logical replication
Hi hackers, During logical replication, if there is a large write transaction, some spill files will be written to disk, depending on the setting of logical_decoding_work_mem. This behavior can effectively avoid OOM, but if the transaction generates a lot of change before commit, a large number of files may fill the disk. For example, you can update a TB-level table. However, I found an inelegant phenomenon. If the modified large table is not published, its changes will also be written with a large number of spill files. Look at an example below: publisher: ``` create table tbl_pub(id int, val1 text, val2 text,val3 text); create table tbl_t1(id int, val1 text, val2 text,val3 text); CREATE PUBLICATION mypub FOR TABLE public.tbl_pub; ``` subscriber: ``` create table tbl_pub(id int, val1 text, val2 text,val3 text); create table tbl_t1(id int, val1 text, val2 text,val3 text); CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432 user=postgres dbname=postgres' PUBLICATION mypub; ``` publisher: ``` begin; insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,999999) i; ``` Later you will see a large number of spill files in the "/$PGDATA/pg_replslot/mysub/" directory. ``` $ll -sh total 4.5G 4.0K -rw------- 1 postgres postgres 200 Nov 30 09:24 state 17M -rw------- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-10000000.spill 12M -rw------- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-1000000.spill 17M -rw------- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-11000000.spill ...... ``` We can see that table tbl_t1 is not published in mypub. It also won't be sent downstream because it's not subscribed. After the transaction is reorganized, the pgoutput decoding plugin filters out changes to these unpublished relationships when sending logical changes. See function pgoutput_change. Most importantly, if we filter out unpublished relationship-related changes after constructing the changes but before queuing the changes into a transaction, will it reduce the workload of logical decoding and avoid disk or memory growth as much as possible? The patch in the attachment is a prototype, which can effectively reduce the memory and disk space usage during logical replication. Design: 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin. 2. Added this callback function pgoutput_table_filter for the pgoutput plugin. Its main implementation is based on the table filter in the pgoutput_change function. Its main function is to determine whether the change needs to be published based on the parameters of the publication, and if not, filter it. 3. After constructing a change and before Queue a change into a transaction, use RelidByRelfilenumber to obtain the relation associated with the change, just like obtaining the relation in the ReorderBufferProcessTXN function. 4. Relation may be a toast, and there is no good way to get its real table relation based on toast relation. Here, I get the real table oid through toast relname, and then get the real table relation. 5. This filtering takes into account INSERT/UPDATE/INSERT. Other changes have not been considered yet and can be expanded in the future. Test: 1. Added a test case 034_table_filter.pl 2. Like the case above, create two tables, the published table tbl_pub and the non-published table tbl_t1 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use pg_ls_replslotdir to record the total size of the slot directory every second. 4. Compare the size of the slot directory at the beginning of the transaction(size1), the size at the end of the transaction (size2), and the average size of the entire process(size3). 5. Assert(size1==size2==size3) Sincerely look forward to your feedback. Regards, lijie
Вложения
On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote: > > Hi hackers, > > During logical replication, if there is a large write transaction, some > spill files will be written to disk, depending on the setting of > logical_decoding_work_mem. > > This behavior can effectively avoid OOM, but if the transaction > generates a lot of change before commit, a large number of files may > fill the disk. For example, you can update a TB-level table. > > However, I found an inelegant phenomenon. If the modified large table is not > published, its changes will also be written with a large number of spill files. > Look at an example below: Thanks. I agree that decoding and queuing the changes of unpublished tables' data into reorder buffer is an unnecessary task for walsender. It takes processing efforts (CPU overhead), consumes disk space and uses memory configured via logical_decoding_work_mem for a replication connection inefficiently. > Later you will see a large number of spill files in the > > We can see that table tbl_t1 is not published in mypub. It also won't be sent > downstream because it's not subscribed. > After the transaction is reorganized, the pgoutput decoding plugin filters out > changes to these unpublished relationships when sending logical changes. > See function pgoutput_change. Right. Here's my testing [1]. > Most importantly, if we filter out unpublished relationship-related > changes after constructing the changes but before queuing the changes > into a transaction, will it reduce the workload of logical decoding > and avoid disk > or memory growth as much as possible? Right. It can. > The patch in the attachment is a prototype, which can effectively reduce the > memory and disk space usage during logical replication. > > Design: > 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin. > > 2. Added this callback function pgoutput_table_filter for the pgoutput plugin. > Its main implementation is based on the table filter in the > pgoutput_change function. > Its main function is to determine whether the change needs to be published based > on the parameters of the publication, and if not, filter it. > > 3. After constructing a change and before Queue a change into a transaction, > use RelidByRelfilenumber to obtain the relation associated with the change, > just like obtaining the relation in the ReorderBufferProcessTXN function. > > 4. Relation may be a toast, and there is no good way to get its real > table relation based on toast relation. Here, I get the real table oid > through toast relname, and then get the real table relation. > > 5. This filtering takes into account INSERT/UPDATE/INSERT. Other > changes have not been considered yet and can be expanded in the future. Design of this patch is based on the principle of logical decoding filtering things out early on and looks very similar to filter_prepare_cb_wrapper/pg_decode_filter_prepare and filter_by_origin_cb/pgoutput_origin_filter. Per my understanding this design looks okay unless I'm missing anything. > Test: > 1. Added a test case 034_table_filter.pl > 2. Like the case above, create two tables, the published table tbl_pub and > the non-published table tbl_t1 > 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use > pg_ls_replslotdir to record the total size of the slot directory > every second. > 4. Compare the size of the slot directory at the beginning of the > transaction(size1), > the size at the end of the transaction (size2), and the average > size of the entire process(size3). > 5. Assert(size1==size2==size3) I bet that the above test with 10K rows is going to take a noticeable time on some buildfarm members (it took 6 seconds on my dev system which is an AWS EC2 instance). And, the above test can get flaky. Therefore, IMO, the concrete way of testing this feature is by looking at the server logs for the following message using PostgreSQL::Test::Cluster log_contains(). +filter_done: + + if (result && RelationIsValid(relation)) + elog(DEBUG1, "logical filter change by table %s", RelationGetRelationName(relation)); + Here are some comments on the v1 patch: 1. @@ -1415,9 +1419,6 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, TupleTableSlot *old_slot = NULL; TupleTableSlot *new_slot = NULL; - if (!is_publishable_relation(relation)) - return; - Instead of removing is_publishable_relation from pgoutput_change, I think it can just be turned into an assertion Assert(is_publishable_relation(relation));, no? 2. + switch (change->action) + { + /* intentionally fall through */ Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the code, otherwise a warning is thrown. 3. From commit message: Most of the code in the FilterByTable function is transplanted from the ReorderBufferProcessTXN function, which can be called before the ReorderBufferQueueChange function.It is I think the above note can just be above the FilterByTable function for better understanding. +static bool +FilterByTable(LogicalDecodingContext *ctx, ReorderBufferChange *change) +{ 4. Why is FilterByTable(ctx, change) call placed after DecodeXLogTuple in DecodeInsert, DecodeUpdate and DecodeDelete? Is there a use for decoded tuples done by DecodeXLogTuple in the new callback filter_by_table_cb? If not, can we move FilterByTable call before DecodeXLogTuple to avoid some more extra processing? 5. Why is ReorderBufferChange needed as a parameter to FilterByTable and filter_by_table_cb? Can't just the LogicalDecodingContext and relation name, the change action be enough to decide if the table is publishable or not? If done this way, it can avoid some more processing, no? 6. Please run pgindent and pgperltidy on the new source code and new TAP test file respectively. [1] HEAD: postgres=# BEGIN; BEGIN Time: 0.110 ms postgres=*# insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99999) i; INSERT 0 100000 Time: 379488.265 ms (06:19.488) postgres=*# ubuntu:~/postgres/pg17/bin$ du -sh /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub 837M /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub ubuntu:~/postgres/pg17/bin$ du -sh /home/ubuntu/postgres/pg17/bin/db17 2.6G /home/ubuntu/postgres/pg17/bin/db17 PATCHED: postgres=# BEGIN; BEGIN Time: 0.105 ms postgres=*# insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99999) i; INSERT 0 100000 Time: 380044.554 ms (06:20.045) ubuntu:~/postgres$ du -sh /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub 8.0K /home/ubuntu/postgres/pg17/bin/db17/pg_replslot/mysub ubuntu:~/postgres$ du -sh /home/ubuntu/postgres/pg17/bin/db17 1.8G /home/ubuntu/postgres/pg17/bin/db17 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote: > > > > Hi hackers, > > > > During logical replication, if there is a large write transaction, some > > spill files will be written to disk, depending on the setting of > > logical_decoding_work_mem. > > > > This behavior can effectively avoid OOM, but if the transaction > > generates a lot of change before commit, a large number of files may > > fill the disk. For example, you can update a TB-level table. > > > > However, I found an inelegant phenomenon. If the modified large table is not > > published, its changes will also be written with a large number of spill files. > > Look at an example below: > > Thanks. I agree that decoding and queuing the changes of unpublished > tables' data into reorder buffer is an unnecessary task for walsender. > It takes processing efforts (CPU overhead), consumes disk space and > uses memory configured via logical_decoding_work_mem for a replication > connection inefficiently. > This is all true but note that in successful cases (where the table is published) all the work done by FilterByTable(accessing caches, transaction-related stuff) can add noticeable overhead as anyway we do that later in pgoutput_change(). I think I gave the same comment earlier as well but didn't see any satisfactory answer or performance data for successful cases to back this proposal. Note, users can configure to stream_in_progress transactions in which case they shouldn't see such a big problem. However, I agree that if we can find some solution where there is no noticeable overhead then that would be worth considering. -- With Regards, Amit Kapila.
On Thu, Jan 18, 2024 at 2:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Jan 17, 2024 at 11:45 AM li jie <ggysxcq@gmail.com> wrote: > > > > > > Hi hackers, > > > > > > During logical replication, if there is a large write transaction, some > > > spill files will be written to disk, depending on the setting of > > > logical_decoding_work_mem. > > > > > > This behavior can effectively avoid OOM, but if the transaction > > > generates a lot of change before commit, a large number of files may > > > fill the disk. For example, you can update a TB-level table. > > > > > > However, I found an inelegant phenomenon. If the modified large table is not > > > published, its changes will also be written with a large number of spill files. > > > Look at an example below: > > > > Thanks. I agree that decoding and queuing the changes of unpublished > > tables' data into reorder buffer is an unnecessary task for walsender. > > It takes processing efforts (CPU overhead), consumes disk space and > > uses memory configured via logical_decoding_work_mem for a replication > > connection inefficiently. > > > > This is all true but note that in successful cases (where the table is > published) all the work done by FilterByTable(accessing caches, > transaction-related stuff) can add noticeable overhead as anyway we do > that later in pgoutput_change(). Right. Overhead for published tables need to be studied. A possible way is to mark the checks performed in FilterByTable/filter_by_table_cb and skip the same checks in pgoutput_change. I'm not sure if this works without any issues though. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, > > This is all true but note that in successful cases (where the table is > published) all the work done by FilterByTable(accessing caches, > transaction-related stuff) can add noticeable overhead as anyway we do > that later in pgoutput_change(). I think I gave the same comment > earlier as well but didn't see any satisfactory answer or performance > data for successful cases to back this proposal. I did some benchmark yesterday at [1] and found it adds 20% cpu time. then come out a basic idea, I think it deserves a share. "transaction related stuff" comes from the syscache/systable access except the HistorySansphot. and the syscache is required in the following sistuations: 1. relfilenode (from wal) -> relid. 2. relid -> namespaceid (to check if the relid is a toast relation). 3. if toast, get its origianl relid. 4. access the data from pg_publication_tables. 5. see if the relid is a partition, if yes, we may get its root relation. Acutally we already has a RelationSyncCache for #4, and it *only* need to access syscache when replicate_valid is false, I think this case should be rare, but the caller doesn't know it, so the caller must prepare the transaction stuff in advance even in the most case they are not used. So I think we can get a optimization here. then the attached patch is made. Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com> Date: Wed Feb 21 18:40:03 2024 +0800 Make get_rel_sync_entry less depending on transaction state. get_rel_sync_entry needs transaction only a replicate_valid = false entry is found, this should be some rare case. However the caller can't know if a entry is valid, so they have to prepare the transaction state before calling this function. Such preparation is expensive. This patch makes the get_rel_sync_entry can manage a transaction stage only if necessary. so the callers don't need to prepare it blindly. Then comes to #1, acutally we have RelfilenumberMapHash as a cache, when the cache is hit (suppose this is a usual case), no transaction stuff related. I have two ideas then: 1. Optimize the cache hit sistuation like what we just did for get_rel_sync_entry for the all the 5 kinds of data and only pay the effort for cache miss case. for the data for #2, #3, #5, all the keys are relid, so I think a same HTAB should be OK. 2. add the content for #1, #2, #3, #5 to wal when wal_level is set to logical. In either case, the changes for get_rel_sync_entry should be needed. > Note, users can > configure to stream_in_progress transactions in which case they > shouldn't see such a big problem. People would see the changes is spilled to disk, but the CPU cost for Reorder should be still paid I think. [1] https://www.postgresql.org/message-id/87o7cadqj3.fsf%40163.com -- Best Regards Andy Fan
Вложения
Hi, Sorry I replied too late. > This is all true but note that in successful cases (where the table is > published) all the work done by FilterByTable(accessing caches, > transaction-related stuff) can add noticeable overhead as anyway we do > that later in pgoutput_change(). You are correct. Frequent opening of transactions and access to cache will cause a lot of overhead, which Andy has tested and proved. The root cause is because every dml wal record needs to do this, which is really wasteful. I use a hash table LocatorFilterCache to solve this problem. After getting a RelFileLocator, I go to the hash table to check its PublicationActions and filter it based on the PublicationActions to determine whether it has been published. The effect of my test is very obvious: (perf record) v1: Children Self Command Shared O Symbol + 22.04% 1.53% postgres postgres [.] FilterByTable v2: Children Self Command Shared O Symbol + 0.58% 0.00% postgres postgres [.] ReorderBufferFilterByLocator v1 patch introduces 20% overhead, while v2 only has 0.58%. > Note, users can >configure to stream_in_progress transactions in which case they > shouldn't see such a big problem. Yes, stream mode can prevent these irrelevant changes from being written to disk or sent to downstream. However, CPU and memory consumption will also be incurred when processing these useless changes. Here is my simple test[1]: base on master : CPU stat: perf stat -p pid -e cycles -I 1000 # time counts unit events 76.007070936 9,691,035 cycles 77.007163484 5,977,694 cycles 78.007252533 5,924,703 cycles 79.007346862 5,861,934 cycles 80.007438070 5,858,264 cycles 81.007527122 6,408,759 cycles 82.007615711 6,397,988 cycles 83.007705685 5,520,407 cycles 84.007794387 5,359,162 cycles 85.007884879 5,194,079 cycles 86.007979797 5,391,270 cycles 87.008069606 5,474,536 cycles 88.008162827 5,594,190 cycles 89.008256327 5,610,023 cycles 90.008349583 5,627,350 cycles 91.008437785 6,273,510 cycles 92.008527938 580,934,205 cycles 93.008620136 4,404,672 cycles 94.008711818 4,599,074 cycles 95.008805591 4,374,958 cycles 96.008894543 4,300,180 cycles 97.008987582 4,157,892 cycles 98.009077445 4,072,178 cycles 99.009163475 4,043,875 cycles 100.009254888 5,382,667 cycles memory stat: pistat -p pid -r 1 10 07:57:18 AM UID PID minflt/s majflt/s VSZ RSS %MEM Command 07:57:19 AM 1000 11848 233.00 0.00 386872 81276 0.01 postgres 07:57:20 AM 1000 11848 235.00 0.00 387008 82068 0.01 postgres 07:57:21 AM 1000 11848 236.00 0.00 387144 83124 0.01 postgres 07:57:22 AM 1000 11848 236.00 0.00 387144 83916 0.01 postgres 07:57:23 AM 1000 11848 236.00 0.00 387280 84972 0.01 postgres 07:57:24 AM 1000 11848 334.00 0.00 337000 36928 0.00 postgres 07:57:25 AM 1000 11848 3.00 0.00 337000 36928 0.00 postgres 07:57:26 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres 07:57:27 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres 07:57:28 AM 1000 11848 0.00 0.00 337000 36928 0.00 postgres Average: 1000 11848 151.30 0.00 362045 60000 0.01 postgres After patched: # time counts unit events 76.007623310 4,237,505 cycles 77.007717436 3,989,618 cycles 78.007813848 3,965,857 cycles 79.007906412 3,601,715 cycles 80.007998111 3,670,835 cycles 81.008092670 3,495,844 cycles 82.008187456 3,822,695 cycles 83.008281335 5,034,146 cycles 84.008374998 3,867,683 cycles 85.008470245 3,996,927 cycles 86.008563783 3,823,893 cycles 87.008658628 3,825,472 cycles 88.008755246 3,823,079 cycles 89.008849719 3,966,083 cycles 90.008945774 4,012,704 cycles 91.009044492 4,026,860 cycles 92.009139621 3,860,912 cycles 93.009242485 3,961,533 cycles 94.009346304 3,799,897 cycles 95.009440164 3,959,602 cycles 96.009534251 3,960,405 cycles 97.009625904 3,762,581 cycles 98.009716518 4,859,490 cycles 99.009807720 3,940,845 cycles 100.009901399 3,888,095 cycles 08:01:47 AM UID PID minflt/s majflt/s VSZ RSS %MEM Command 08:01:48 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:49 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:50 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:51 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:52 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:53 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:54 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:55 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:56 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres 08:01:57 AM 1000 19466 0.00 0.00 324424 15140 0.00 postgres Average: 1000 19466 0.00 0.00 324424 15140 0.00 postgres Through comparison, it is found that patch is also profitable for stream mode. Of course, LocatorFilterCache also need to deal with invalidation, such as the corresponding relation invalidate, or pg_publication changes, just like RelationSyncCache and RelfilenumberMapHash. But ddl is a small amount after all, which is insignificant compared to a large amount of dml. Another problem is that the LocatorFilterCache looks redundant compared to RelationSyncCache and RelfilenumberMapHash. like this: 1. RelfilenumberMapHash: relfilenode -> relation oid 2. RelationSyncCache: relation oid-> PublicationActions 3. LocatorFilterCache: RelFileLocator-> PublicationActions The reason is that you cannot simply access two caches from the relfilenode --> PublicationActions, and you must use historical snapshots to access transactions and relcache in the middle, so there is no good solution for this for the time being, ugly but effective. >Therefore, IMO, the concrete way of testing this feature is by looking >at the server logs for the following message using >PostgreSQL::Test::Cluster log_contains(). thinks, done. >Instead of removing is_publishable_relation from pgoutput_change, I >think it can just be turned into an assertion >Assert(is_publishable_relation(relation));, no? yes, done. >Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the >code, otherwise a warning is thrown. /* intentionally fall through */ can also avoid warnings. >Can't just the LogicalDecodingContext and >relation name, the change action be enough to decide if the table is >publishable or not? If done this way, it can avoid some more >processing, no? yes, RelFileLocator filtering is used directly in v2, and change is no longer required. >Please run pgindent and pgperltidy on the new source code and new >TAP test file respectively. ok. [1]: https://www.postgresql.org/message-id/CAGfChW62f5NTNbLsqO-6_CrmKPqBEQtWPcPDafu8pCwZznk%3Dxw%40mail.gmail.com