Обсуждение: postgres_fdw: Use COPY to speed up batch inserts
Hi all, Currently on postgres_fdw we use prepared statements to insert batches into foreign tables. Although this works fine for the most use cases the COPY command can also be used in some scenarios to speed up large batch inserts. The attached patch implements this idea of using the COPY command for batch inserts on postgres_fdw foreign tables. I've performed some benchmarks using pgbench and the results seem good to consider this. I've performed the benchmark using different batch_size values to see when this optimization could be useful. The following results are the best tps of 3 runs. Command: pgbench -n -c 10 -j 10 -t 100 -f bench.sql postgres batch_size: 10 master tps: 76.360406 patch tps: 68.917109 batch_size: 100 master tps: 123.427731 patch tps: 243.737055 batch_size: 1000 master tps: 132.500506 patch tps: 239.295132 It seems that using a batch_size greater than 100 we can have a considerable speed up for batch inserts. The attached patch uses the COPY command whenever we have a *numSlots > 1 but the tests show that maybe we should have a GUC to enable this? I also think that we can have a better patch by removing the duplicated code introduced on this first version, specially on the clean up phase, but I tried to keep things more simple on this initial phase to keep the review more easier and also just to test the idea. Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE) output for batch inserts that use the COPY to mention that we are sending the COPY command to the remote server. I guess so? (this proposal is based on a patch idea written by Tomas Vondra in one of his blogs posts) -- Matheus Alcantara
Вложения
Hi Matheus, Thanks for the patch. Please add it to the next committfest (PG19-3) at https://commitfest.postgresql.org/ so that we don't lose track of the patch. On 10/15/25 17:02, Matheus Alcantara wrote: > Hi all, > > Currently on postgres_fdw we use prepared statements to insert batches > into foreign tables. Although this works fine for the most use cases the > COPY command can also be used in some scenarios to speed up large batch > inserts. > Makes sense. > The attached patch implements this idea of using the COPY command for > batch inserts on postgres_fdw foreign tables. I've performed some > benchmarks using pgbench and the results seem good to consider this. > Thanks. The code looks sensible in general, I think. I'll have a couple minor comments. It'd be good to also update the documentation, and add some tests to postgres_fdw.sql, to exercise this new code. The sgml docs (doc/src/sgml/postgres-fdw.sgml) mention batch_size, and explain how it's related to the number of parameters in the INSERT state. Which does not matter when using COPY under the hood, so this should be amended/clarified in some way. It doesn't need to be super-detailed, though. A couple minor comments about the code: 1) buildCopySql - Does this really need the "(FORMAT TEXT, DELIMITER ',')" part? AFAIK no, if we use the default copy format in convert_slot_to_copy_text. - Shouldn't we cache the COPY SQL, similarly to how we keep insert SQL? Instead of rebuilding it over and over for every batch. 2) convert_slot_to_copy_text - It's probably better to not hard-code the delimiters etc. - I wonder if the formatting needs to do something more like what copyto.c does (through CopyToTextOneRow and CopyAttributeOutText). Maybe not, not sure. 3) execute_foreign_modify - I think the new block of code is a bit confusing. It essentially does something similar to the original code, but not entirely. I suggest we move it to a new function, and call that from execute_foreign_modify. - I agree it's probably better to do COPYBUFSIZ, or something like that to send the data in smaller (but not tiny) chunks. > I've performed the benchmark using different batch_size values to see > when this optimization could be useful. The following results are the > best tps of 3 runs. > > Command: pgbench -n -c 10 -j 10 -t 100 -f bench.sql postgres > > batch_size: 10 > master tps: 76.360406 > patch tps: 68.917109 > > batch_size: 100 > master tps: 123.427731 > patch tps: 243.737055 > > batch_size: 1000 > master tps: 132.500506 > patch tps: 239.295132 > > It seems that using a batch_size greater than 100 we can have a > considerable speed up for batch inserts. > I did a bunch of benchmarks too, and I see similar speedups (depending on the batch and data size). Attached is the script I used to run this. It runs COPY into a foreign table, pointing to a second instance on the same machine. And it does that for a range of data sizes, batch sizes, client counts and logged/unlogged table. The attached PDF summarizes the results, comparing "copy" build (with this patch) to "master". There's also two "resourceowner" builds, with this patch [1] - that helps COPY in general a lot, and it improves the case with batch_size=1000. I think the results are good, at least with larger batch sizes. The regressions with batch_size=10 on UNLOGGED table are not great, though. I have results from another machine, and there it affects even LOGGED table. I'm not sure if this inherent, or something the patch can fix. In a way, it's not surprising - batching with tiny batches adds the overhead without much benefit. I don't think that kills the patch. > The attached patch uses the COPY command whenever we have a *numSlots > > 1 but the tests show that maybe we should have a GUC to enable this? > I can imagine having a GUC for testing, but it's not strictly necessary. > I also think that we can have a better patch by removing the duplicated > code introduced on this first version, specially on the clean up phase, > but I tried to keep things more simple on this initial phase to keep the > review more easier and also just to test the idea. > > Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE) > output for batch inserts that use the COPY to mention that we are > sending the COPY command to the remote server. I guess so? > Good point. We definitely should not show SQL for INSERT, when we're actually running a COPY. regards [1] https://www.postgresql.org/message-id/84f20db7-7d57-4dc0-8144-7e38e0bbe75d%40vondra.me -- Tomas Vondra
Вложения
On Thu, Oct 16, 2025 at 10:42 PM Tomas Vondra <tomas@vondra.me> wrote:
> Thanks for the patch. Please add it to the next committfest (PG19-3) at
Hi Matheus! same here - thanks for the patch!
> > The attached patch uses the COPY command whenever we have a *numSlots >
> > 1 but the tests show that maybe we should have a GUC to enable this?
> >
>
> I can imagine having a GUC for testing, but it's not strictly necessary.
Just note, I've played maybe like 20mins with this patch and it works,
however if we would like to have yet another GUCs then we would need
to enable two of those? (enable batch_size and this hypothetical
`batch_use_copy`?)
Some other stuff I've tried to cover:
1. how this works with INSERT RETURNING -> as per patch it fallbacks
from COPY to INSERT as expected
2. how this works with INSERT ON CONFLICT -> well, we cannot have
constraints on postgres_fdw, so it is impossible
3. how this works with MERGE -> well, MERGE doesnt work with postgres_fdw
4. I've found that big rows don't play with COPY feature without
memory limitation, so probably some special handling should be done
here, it's nonsense , but:
postgres@postgres:1236 : 15836 # INSERT INTO local_t1 (id, t)
SELECT s, repeat(md5(s::text), 10000000) from generate_series(100,
103) s;
2025-10-17 11:17:08.742 CEST [15836] LOG: statement: INSERT INTO
local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from
generate_series(100, 103) s;
2025-10-17 11:17:08.743 CEST [15838] LOG: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
2025-10-17 11:17:38.302 CEST [15838] LOG: statement: COPY
public.t1(id, t, counter) FROM STDIN (FORMAT TEXT, DELIMITER ',')
ERROR: string buffer exceeds maximum allowed length (1073741823 bytes)
DETAIL: Cannot enlarge string buffer containing 960000028 bytes
by 320000000 more bytes.
2025-10-17 11:17:40.213 CEST [15836] ERROR: string buffer exceeds
maximum allowed length (1073741823 bytes)
2025-10-17 11:17:40.213 CEST [15836] DETAIL: Cannot enlarge
string buffer containing 960000028 bytes by 320000000 more bytes.
2025-10-17 11:17:40.213 CEST [15836] STATEMENT: INSERT INTO
local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from
generate_series(100, 103) s;
but then it never wants to finish that backend (constant loop[ in
PQCleanup() or somewhere close to that), server behaves unstable.
Without batch_size set the very same INSERT behaves OK.
Regards,
-J.
Thanks for the review! On Thu Oct 16, 2025 at 5:39 PM -03, Tomas Vondra wrote: > Thanks for the patch. Please add it to the next committfest (PG19-3) at > > https://commitfest.postgresql.org/ > > so that we don't lose track of the patch. > Here it is: https://commitfest.postgresql.org/patch/6137/ > Thanks. The code looks sensible in general, I think. I'll have a couple > minor comments. It'd be good to also update the documentation, and add > some tests to postgres_fdw.sql, to exercise this new code. > > The sgml docs (doc/src/sgml/postgres-fdw.sgml) mention batch_size, and > explain how it's related to the number of parameters in the INSERT > state. Which does not matter when using COPY under the hood, so this > should be amended/clarified in some way. It doesn't need to be > super-detailed, though. > I'll work on this and intend to have something on the next version. > A couple minor comments about the code: > > 1) buildCopySql > > - Does this really need the "(FORMAT TEXT, DELIMITER ',')" part? AFAIK > no, if we use the default copy format in convert_slot_to_copy_text. > You're right. I've removed the (FORMAT TEXT, DELIMITER ',') part. > - Shouldn't we cache the COPY SQL, similarly to how we keep insert SQL? > Instead of rebuilding it over and over for every batch. > I tried to reuse the fmstate->query field to cache the COPY sql but running the postgres_fdw.sql regress test shows that this may not work. When we are running a user supplied COPY command on a foreign table the CopyMultiInsertBufferFlush() call ri_FdwRoutine->ExecForeignBatchInsert which may pass different values for numSlots based on the number of slots already sent to the foreign server, and eventually it may pass numSlots as 1 which will not use the COPY under the hood to send to the foreign server and if we cache the COPY command into the fmstate->query this will not work because the normal INSERT path on execute_foreign_modify uses the fmstate->query to build a prepared statement to send to the foreign server. So basically what I'm trying to say is that when the server is executing a COPY into a foreign it may use the COPY command or INSERT command to send the data to the foreign server. That being said, I decided to create a new copy_query field on PgFdwModifyState to cache only COPY commands. Please let me know if my understanding is wrong or if we could have a better approach here. > 2) convert_slot_to_copy_text > > - It's probably better to not hard-code the delimiters etc. > Since now we are using the default copy format, it's safe to hard code the default delimiter which is \t? Or should we still make this parameterizable or something else? > - I wonder if the formatting needs to do something more like what > copyto.c does (through CopyToTextOneRow and CopyAttributeOutText). Maybe > not, not sure. > I'm still checking this, I think that is not needed but I want to do some more tests to make sure. > 3) execute_foreign_modify > > - I think the new block of code is a bit confusing. It essentially does > something similar to the original code, but not entirely. I suggest we > move it to a new function, and call that from execute_foreign_modify. > Fixed > - I agree it's probably better to do COPYBUFSIZ, or something like that > to send the data in smaller (but not tiny) chunks. > Fixed. I've declared the default chunk size as 8192 (which is the same used on psql/copy.c), let me know if we should use another value or perhaps make this configurable. >> Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE) >> output for batch inserts that use the COPY to mention that we are >> sending the COPY command to the remote server. I guess so? >> > > Good point. We definitely should not show SQL for INSERT, when we're > actually running a COPY. > This seems a bit tricky to implement. The COPY is used based on the number of slots into the TupleTableSlot array that is used for batch insert. The numSlots that execute_foreign_modify() receive is coming from ResultRelInfo->ri_NumSlots during ExecInsert(). We don't have this information during EXPLAIN that is handled by postgresExplainForeignModify(), we only have the ResultRelInfo->ri_BatchSize at this stage. The current idea is to use the COPY command if the number of slots is > 1 so I'm wondering if we should use another mechanism to enable the COPY usage, for example, we could just use if the batch_size is configured to a number greater than X, but what if the INSERT statement is only inserting a single row, should we still use the COPY command to ingest a single row into the foreign table? Any thoughts? -- Matheus Alcantara
Вложения
Thanks for testing and for the comments! On Fri Oct 17, 2025 at 6:28 AM -03, Jakub Wartak wrote: > On Thu, Oct 16, 2025 at 10:42 PM Tomas Vondra <tomas@vondra.me> wrote: >> Thanks for the patch. Please add it to the next committfest (PG19-3) at > > Hi Matheus! same here - thanks for the patch! > >> > The attached patch uses the COPY command whenever we have a *numSlots > >> > 1 but the tests show that maybe we should have a GUC to enable this? >> > >> >> I can imagine having a GUC for testing, but it's not strictly necessary. > > Just note, I've played maybe like 20mins with this patch and it works, > however if we would like to have yet another GUCs then we would need > to enable two of those? (enable batch_size and this hypothetical > `batch_use_copy`?) > I was thinking in a GUC to enable to use the COPY command if the number of rows being ingested during the batch insert is greater than the value configured on this GUC, something like, batch_size_for_copy. But for now I'm starting to think that perhaps we may use the COPY if the batch_size is configured to a number > 1(or make this configurable?). With this it would make more easier to show on EXPLAIN that we will send a COPY command to the remote server instead of INSERT. The currently patch relies on the number of rows being sent to the foreign server to enable the COPY usage or not, and IUUC we don't have this information during EXPLAIN. I wrote more about this on my previous reply [1]. > Some other stuff I've tried to cover: >... > 4. I've found that big rows don't play with COPY feature without > memory limitation, so probably some special handling should be done > here, it's nonsense , but: > > postgres@postgres:1236 : 15836 # INSERT INTO local_t1 (id, t) > SELECT s, repeat(md5(s::text), 10000000) from generate_series(100, > 103) s; > 2025-10-17 11:17:08.742 CEST [15836] LOG: statement: INSERT INTO > local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from > generate_series(100, 103) s; > 2025-10-17 11:17:08.743 CEST [15838] LOG: statement: START > TRANSACTION ISOLATION LEVEL REPEATABLE READ > 2025-10-17 11:17:38.302 CEST [15838] LOG: statement: COPY > public.t1(id, t, counter) FROM STDIN (FORMAT TEXT, DELIMITER ',') > ERROR: string buffer exceeds maximum allowed length (1073741823 bytes) > DETAIL: Cannot enlarge string buffer containing 960000028 bytes > by 320000000 more bytes. > 2025-10-17 11:17:40.213 CEST [15836] ERROR: string buffer exceeds > maximum allowed length (1073741823 bytes) > 2025-10-17 11:17:40.213 CEST [15836] DETAIL: Cannot enlarge > string buffer containing 960000028 bytes by 320000000 more bytes. > 2025-10-17 11:17:40.213 CEST [15836] STATEMENT: INSERT INTO > local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from > generate_series(100, 103) s; > > but then it never wants to finish that backend (constant loop[ in > PQCleanup() or somewhere close to that), server behaves unstable. > Without batch_size set the very same INSERT behaves OK. > On the last version that I sent on [1] I introduce a buffer size limit, and testing this INSERT statement with the latest version seems to fix this issue. Could you please check this too? [1] https://www.postgresql.org/message-id/CAFY6G8ePwjT8GiJX1AK5FDMhfq-sOnny6optgTPg98HQw7oJ0g%40mail.gmail.com -- Matheus Alcantara
On Tue Oct 21, 2025 at 11:25 AM -03, Matheus Alcantara wrote: >>> Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE) >>> output for batch inserts that use the COPY to mention that we are >>> sending the COPY command to the remote server. I guess so? >>> >> >> Good point. We definitely should not show SQL for INSERT, when we're >> actually running a COPY. >> > This seems a bit tricky to implement. The COPY is used based on the > number of slots into the TupleTableSlot array that is used for batch > insert. The numSlots that execute_foreign_modify() receive is coming > from ResultRelInfo->ri_NumSlots during ExecInsert(). We don't have this > information during EXPLAIN that is handled by > postgresExplainForeignModify(), we only have the > ResultRelInfo->ri_BatchSize at this stage. The current idea is to use > the COPY command if the number of slots is > 1 so I'm wondering if we > should use another mechanism to enable the COPY usage, for example, we > could just use if the batch_size is configured to a number greater than > X, but what if the INSERT statement is only inserting a single row, > should we still use the COPY command to ingest a single row into the > foreign table? Any thoughts? > Thinking more about this I realize that when we are deparsing the remote SQL to be sent to the foreign server at the planner phase (via postgresPlanForeignModify()) we don't have the batch_size and number of rows information, so currently we can not know at the plan time if the COPY usage for a batch insert is visible or not because IIUC these information are only visible at query runtime. One way to make it possible is that we could simply use the PgFdwModifyState->copy_data during postgresExplainForeignModify() if it's not null. Since we will only have this information during query execution the drawback of this approach is that we would only show the COPY as a Remote SQL on during EXPLAIN(ANALYZE). Please see the attached v3 version that implements this idea. > I tried to reuse the fmstate->query field to cache the COPY sql but > running the postgres_fdw.sql regress test shows that this may not > work. When we are running a user supplied COPY command on a foreign > table the CopyMultiInsertBufferFlush() call > ri_FdwRoutine->ExecForeignBatchInsert which may pass different values > for numSlots based on the number of slots already sent to the foreign > server, and eventually it may pass numSlots as 1 which will not use the > COPY under the hood to send to the foreign server and if we cache the > COPY command into the fmstate->query this will not work because the > normal INSERT path on execute_foreign_modify uses the fmstate->query to > build a prepared statement to send to the foreign server. So basically > what I'm trying to say is that when the server is executing a COPY into > a foreign it may use the COPY command or INSERT command to send the data > to the foreign server. That being said, I decided to create a new > copy_query field on PgFdwModifyState to cache only COPY commands. Please > let me know if my understanding is wrong or if we could have a better > approach here. > Based on the information that I've mention above I think that we need some way to not mix INSERT with COPY commands when executing a COPY in a foreign table supplied by the user. Or we should disable the COPY under the hood and always fallback to INSERT or enable the COPY to use when the *numSlots is 1, so in case of an EXPLAIN(ANALYZE) output we can show the Remote SQL correctly. Is that make sense? I'm still not sure if the trigger to use the COPY command for batch insert should be *numSlots > 1 or something else. I'm open for better ideas. Thoughts? -- Matheus Alcantara
Вложения
On Thu, Oct 23, 2025 at 8:01 AM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > > Please see the attached v3 version that implements this idea. > hi. I am not famailith with this module. some of the foreach can be replaced with foreach_int. I suspect that somewhere Form_pg_attribute.attisdropped is not handled properly. the following setup will crash. ---source database drop table batch_table1; create table batch_table1(x int); ---foreign table database drop foreign table if exists ftable1; CREATE FOREIGN TABLE ftable1 ( x int ) SERVER loopback1 OPTIONS ( table_name 'batch_table1', batch_size '10' ); ALTER FOREIGN TABLE ftable1 DROP COLUMN x; ALTER FOREIGN TABLE ftable1 add COLUMN x int; INSERT INTO ftable SELECT * FROM generate_series(1, 10) i; --- this will cause server crash.
Hi, thanks for testing this patch!
On Thu Oct 23, 2025 at 6:49 AM -03, jian he wrote:
> On Thu, Oct 23, 2025 at 8:01 AM Matheus Alcantara
> <matheusssilv97@gmail.com> wrote:
>>
>> Please see the attached v3 version that implements this idea.
>>
> hi.
>
> I am not famailith with this module.
> some of the foreach can be replaced with foreach_int.
>
Fixed.
> I suspect that somewhere Form_pg_attribute.attisdropped is not handled properly.
> the following setup will crash.
>
> ---source database
> drop table batch_table1;
> create table batch_table1(x int);
>
> ---foreign table database
> drop foreign table if exists ftable1;
> CREATE FOREIGN TABLE ftable1 ( x int ) SERVER loopback1 OPTIONS (
> table_name 'batch_table1', batch_size '10' );
> ALTER FOREIGN TABLE ftable1 DROP COLUMN x;
> ALTER FOREIGN TABLE ftable1 add COLUMN x int;
>
> INSERT INTO ftable SELECT * FROM generate_series(1, 10) i; --- this
> will cause server crash.
>
I've tested this scenario and the field attisdropped is still being set
to false. After some debugging I realize that the problem was how I was
accessing the fmstate->p_flinfo array - I was using the attum-1 but I
don't think that it's correct.
On create_foreign_modify() we have the following code:
fmstate->p_flinfo = (FmgrInfo *) palloc0(sizeof(FmgrInfo) * n_params);
fmstate->p_nums = 0;
if (operation == CMD_INSERT || operation == CMD_UPDATE)
{
/* Set up for remaining transmittable parameters */
foreach(lc, fmstate->target_attrs)
{
int attnum = lfirst_int(lc);
Form_pg_attribute attr = TupleDescAttr(tupdesc, attnum - 1);
Assert(!attr->attisdropped);
/* Ignore generated columns; they are set to DEFAULT */
if (attr->attgenerated)
continue;
getTypeOutputInfo(attr->atttypid, &typefnoid, &isvarlena);
fmgr_info(typefnoid, &fmstate->p_flinfo[fmstate->p_nums]);
fmstate->p_nums++;
}
}
So I think that I should access fmstate->p_flinfo array when looping
through the target_attrs using an int value starting at 0 and ++ after
each iteration. Although I'm not sure if my understanding is fully
correct I've implemented this on the attached patch and it seems to fix
the error.
On this new version I also added some regress tests on postgres_fdw.sql
--
Matheus Alcantara