Обсуждение: postgres_fdw: Use COPY to speed up batch inserts

Поиск

Список

Период

Сортировка

postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

15 октября, 18:02:01

Hi all,

Currently on postgres_fdw we use prepared statements to insert batches
into foreign tables. Although this works fine for the most use cases the
COPY command can also be used in some scenarios to speed up large batch
inserts.

The attached patch implements this idea of using the COPY command for
batch inserts on postgres_fdw foreign tables. I've performed some
benchmarks using pgbench and the results seem good to consider this.

I've performed the benchmark using different batch_size values to see
when this optimization could be useful. The following results are the
best tps of 3 runs.

Command: pgbench -n -c 10 -j 10 -t 100 -f bench.sql postgres

batch_size: 10
    master tps: 76.360406
    patch tps: 68.917109

batch_size: 100
    master tps: 123.427731
    patch tps: 243.737055

batch_size: 1000
    master tps: 132.500506
    patch tps: 239.295132

It seems that using a batch_size greater than 100 we can have a
considerable speed up for batch inserts.

The attached patch uses the COPY command whenever we have a *numSlots >
1 but the tests show that maybe we should have a GUC to enable this?

I also think that we can have a better patch by removing the duplicated
code introduced on this first version, specially on the clean up phase,
but I tried to keep things more simple on this initial phase to keep the
review more easier and also just to test the idea.

Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE)
output for batch inserts that use the COPY to mention that we are
sending the COPY command to the remote server. I guess so?

(this proposal is based on a patch idea written by Tomas Vondra in one
of his blogs posts)

--
Matheus Alcantara

Вложения

Re: postgres_fdw: Use COPY to speed up batch inserts

От

Tomas Vondra

Дата:

16 октября, 23:39:47

Hi Matheus,

Thanks for the patch. Please add it to the next committfest (PG19-3) at

  https://commitfest.postgresql.org/

so that we don't lose track of the patch.

On 10/15/25 17:02, Matheus Alcantara wrote:
> Hi all,
> 
> Currently on postgres_fdw we use prepared statements to insert batches
> into foreign tables. Although this works fine for the most use cases the
> COPY command can also be used in some scenarios to speed up large batch
> inserts.
> 

Makes sense.

> The attached patch implements this idea of using the COPY command for
> batch inserts on postgres_fdw foreign tables. I've performed some
> benchmarks using pgbench and the results seem good to consider this.
> 

Thanks. The code looks sensible in general, I think. I'll have a couple
minor comments. It'd be good to also update the documentation, and add
some tests to postgres_fdw.sql, to exercise this new code.

The sgml docs (doc/src/sgml/postgres-fdw.sgml) mention batch_size, and
explain how it's related to the number of parameters in the INSERT
state. Which does not matter when using COPY under the hood, so this
should be amended/clarified in some way. It doesn't need to be
super-detailed, though.

A couple minor comments about the code:

1) buildCopySql

- Does this really need the "(FORMAT TEXT, DELIMITER ',')" part? AFAIK
no, if we use the default copy format in convert_slot_to_copy_text.

- Shouldn't we cache the COPY SQL, similarly to how we keep insert SQL?
Instead of rebuilding it over and over for every batch.

2) convert_slot_to_copy_text

- It's probably better to not hard-code the delimiters etc.

- I wonder if the formatting needs to do something more like what
copyto.c does (through CopyToTextOneRow and CopyAttributeOutText). Maybe
not, not sure.

3) execute_foreign_modify

-  I think the new block of code is a bit confusing. It essentially does
something similar to the original code, but not entirely. I suggest we
move it to a new function, and call that from execute_foreign_modify.

- I agree it's probably better to do COPYBUFSIZ, or something like that
to send the data in smaller (but not tiny) chunks.

> I've performed the benchmark using different batch_size values to see
> when this optimization could be useful. The following results are the
> best tps of 3 runs.
> 
> Command: pgbench -n -c 10 -j 10 -t 100 -f bench.sql postgres
> 
> batch_size: 10
>     master tps: 76.360406
>     patch tps: 68.917109
> 
> batch_size: 100
>     master tps: 123.427731
>     patch tps: 243.737055
> 
> batch_size: 1000
>     master tps: 132.500506
>     patch tps: 239.295132
> 
> It seems that using a batch_size greater than 100 we can have a
> considerable speed up for batch inserts.
> 

I did a bunch of benchmarks too, and I see similar speedups (depending
on the batch and data size). Attached is the script I used to run this.
It runs COPY into a foreign table, pointing to a second instance on the
same machine. And it does that for a range of data sizes, batch sizes,
client counts and logged/unlogged table.

The attached PDF summarizes the results, comparing "copy" build (with
this patch) to "master". There's also two "resourceowner" builds, with
this patch [1] - that helps COPY in general a lot, and it improves the
case with batch_size=1000.

I think the results are good, at least with larger batch sizes. The
regressions with batch_size=10 on UNLOGGED table are not great, though.
I have results from another machine, and there it affects even LOGGED
table. I'm not sure if this inherent, or something the patch can fix.

In a way, it's not surprising - batching with tiny batches adds the
overhead without much benefit. I don't think that kills the patch.

> The attached patch uses the COPY command whenever we have a *numSlots >
> 1 but the tests show that maybe we should have a GUC to enable this?
> 

I can imagine having a GUC for testing, but it's not strictly necessary.

> I also think that we can have a better patch by removing the duplicated
> code introduced on this first version, specially on the clean up phase,
> but I tried to keep things more simple on this initial phase to keep the
> review more easier and also just to test the idea.
> 
> Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE)
> output for batch inserts that use the COPY to mention that we are
> sending the COPY command to the remote server. I guess so?
> 

Good point. We definitely should not show SQL for INSERT, when we're
actually running a COPY.

regards

[1]
https://www.postgresql.org/message-id/84f20db7-7d57-4dc0-8144-7e38e0bbe75d%40vondra.me

-- 
Tomas Vondra

Вложения

Re: postgres_fdw: Use COPY to speed up batch inserts

От

Jakub Wartak

Дата:

17 октября, 12:28:44

On Thu, Oct 16, 2025 at 10:42 PM Tomas Vondra <tomas@vondra.me> wrote:
> Thanks for the patch. Please add it to the next committfest (PG19-3) at

Hi Matheus! same here - thanks for the patch!

> > The attached patch uses the COPY command whenever we have a *numSlots >
> > 1 but the tests show that maybe we should have a GUC to enable this?
> >
>
> I can imagine having a GUC for testing, but it's not strictly necessary.

Just note, I've played maybe like 20mins with this patch and it works,
however if we would like to have yet another GUCs then we would need
to enable two of those? (enable batch_size and this hypothetical
`batch_use_copy`?)

Some other stuff I've tried to cover:
1. how this works with INSERT RETURNING -> as per patch it fallbacks
from COPY to INSERT as expected
2. how this works with INSERT ON CONFLICT -> well, we cannot have
constraints on postgres_fdw, so it is impossible
3. how this works with MERGE -> well, MERGE doesnt work with postgres_fdw
4. I've found that big rows don't play with COPY feature without
memory limitation, so probably some special handling should be done
here, it's nonsense , but:

    postgres@postgres:1236 : 15836 # INSERT INTO local_t1 (id, t)
SELECT s, repeat(md5(s::text), 10000000) from generate_series(100,
103) s;
    2025-10-17 11:17:08.742 CEST [15836] LOG:  statement: INSERT INTO
local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from
generate_series(100, 103) s;
    2025-10-17 11:17:08.743 CEST [15838] LOG:  statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
    2025-10-17 11:17:38.302 CEST [15838] LOG:  statement: COPY
public.t1(id, t, counter) FROM STDIN (FORMAT TEXT, DELIMITER ',')
    ERROR:  string buffer exceeds maximum allowed length (1073741823 bytes)
    DETAIL:  Cannot enlarge string buffer containing 960000028 bytes
by 320000000 more bytes.
    2025-10-17 11:17:40.213 CEST [15836] ERROR:  string buffer exceeds
maximum allowed length (1073741823 bytes)
    2025-10-17 11:17:40.213 CEST [15836] DETAIL:  Cannot enlarge
string buffer containing 960000028 bytes by 320000000 more bytes.
    2025-10-17 11:17:40.213 CEST [15836] STATEMENT:  INSERT INTO
local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from
generate_series(100, 103) s;

    but then it never wants to finish that backend (constant loop[ in
PQCleanup() or somewhere close to that), server behaves unstable.
    Without batch_size set the very same INSERT behaves OK.

Regards,
-J.

Re: postgres_fdw: Use COPY to speed up batch inserts

От

Matheus Alcantara

Дата:

21 октября, 17:25:19

Thanks for the review!

On Thu Oct 16, 2025 at 5:39 PM -03, Tomas Vondra wrote:
> Thanks for the patch. Please add it to the next committfest (PG19-3) at
>
>   https://commitfest.postgresql.org/
>
> so that we don't lose track of the patch.
>
Here it is: https://commitfest.postgresql.org/patch/6137/

> Thanks. The code looks sensible in general, I think. I'll have a couple
> minor comments. It'd be good to also update the documentation, and add
> some tests to postgres_fdw.sql, to exercise this new code.
>
> The sgml docs (doc/src/sgml/postgres-fdw.sgml) mention batch_size, and
> explain how it's related to the number of parameters in the INSERT
> state. Which does not matter when using COPY under the hood, so this
> should be amended/clarified in some way. It doesn't need to be
> super-detailed, though.
>
I'll work on this and intend to have something on the next version.

> A couple minor comments about the code:
>
> 1) buildCopySql
>
> - Does this really need the "(FORMAT TEXT, DELIMITER ',')" part? AFAIK
> no, if we use the default copy format in convert_slot_to_copy_text.
>
You're right. I've removed the (FORMAT TEXT, DELIMITER ',') part.

> - Shouldn't we cache the COPY SQL, similarly to how we keep insert SQL?
> Instead of rebuilding it over and over for every batch.
>
I tried to reuse the fmstate->query field to cache the COPY sql but
running the postgres_fdw.sql regress test shows that this may not
work. When we are running a user supplied COPY command on a foreign
table the CopyMultiInsertBufferFlush() call
ri_FdwRoutine->ExecForeignBatchInsert which may pass different values
for numSlots based on the number of slots already sent to the foreign
server, and eventually it may pass numSlots as 1 which will not use the
COPY under the hood to send to the foreign server and if we cache the
COPY command into the fmstate->query this will not work because the
normal INSERT path on execute_foreign_modify uses the fmstate->query to
build a prepared statement to send to the foreign server. So basically
what I'm trying to say is that when the server is executing a COPY into
a foreign it may use the COPY command or INSERT command to send the data
to the foreign server. That being said, I decided to create a new
copy_query field on PgFdwModifyState to cache only COPY commands. Please
let me know if my understanding is wrong or if we could have a better
approach here.

> 2) convert_slot_to_copy_text
>
> - It's probably better to not hard-code the delimiters etc.
>
Since now we are using the default copy format, it's safe to hard code
the default delimiter which is \t? Or should we still make this
parameterizable or something else?

> - I wonder if the formatting needs to do something more like what
> copyto.c does (through CopyToTextOneRow and CopyAttributeOutText). Maybe
> not, not sure.
>
I'm still checking this, I think that is not needed but I want to do
some more tests to make sure.

> 3) execute_foreign_modify
>
> -  I think the new block of code is a bit confusing. It essentially does
> something similar to the original code, but not entirely. I suggest we
> move it to a new function, and call that from execute_foreign_modify.
>
Fixed

> - I agree it's probably better to do COPYBUFSIZ, or something like that
> to send the data in smaller (but not tiny) chunks.
>
Fixed. I've declared the default chunk size as 8192 (which is the same
used on psql/copy.c), let me know if we should use another value or
perhaps make this configurable.

>> Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE)
>> output for batch inserts that use the COPY to mention that we are
>> sending the COPY command to the remote server. I guess so?
>>
>
> Good point. We definitely should not show SQL for INSERT, when we're
> actually running a COPY.
>
This seems a bit tricky to implement. The COPY is used based on the
number of slots into the TupleTableSlot array that is used for batch
insert. The numSlots that execute_foreign_modify() receive is coming
from ResultRelInfo->ri_NumSlots during ExecInsert(). We don't have this
information during EXPLAIN that is handled by
postgresExplainForeignModify(), we only have the
ResultRelInfo->ri_BatchSize at this stage. The current idea is to use
the COPY command if the number of slots is > 1 so I'm wondering if we
should use another mechanism to enable the COPY usage, for example, we
could just use if the batch_size is configured to a number greater than
X, but what if the INSERT statement is only inserting a single row,
should we still use the COPY command to ingest a single row into the
foreign table? Any thoughts?

--
Matheus Alcantara

Вложения

v2-0001-postgres_fdw-Use-COPY-to-speed-up-batch-inserts.patch

Re: postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

21 октября, 17:26:50

Thanks for testing and for the comments!

On Fri Oct 17, 2025 at 6:28 AM -03, Jakub Wartak wrote:
> On Thu, Oct 16, 2025 at 10:42 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Thanks for the patch. Please add it to the next committfest (PG19-3) at
>
> Hi Matheus! same here - thanks for the patch!
>
>> > The attached patch uses the COPY command whenever we have a *numSlots >
>> > 1 but the tests show that maybe we should have a GUC to enable this?
>> >
>>
>> I can imagine having a GUC for testing, but it's not strictly necessary.
>
> Just note, I've played maybe like 20mins with this patch and it works,
> however if we would like to have yet another GUCs then we would need
> to enable two of those? (enable batch_size and this hypothetical
> `batch_use_copy`?)
>
I was thinking in a GUC to enable to use the COPY command if the number
of rows being ingested during the batch insert is greater than the value
configured on this GUC, something like, batch_size_for_copy. But for now
I'm starting to think that perhaps we may use the COPY if the batch_size
is configured to a number > 1(or make this configurable?). With this it
would make more easier to show on EXPLAIN that we will send a COPY
command to the remote server instead of INSERT. The currently patch
relies on the number of rows being sent to the foreign server to enable
the COPY usage or not, and IUUC we don't have this information during
EXPLAIN. I wrote more about this on my previous reply [1].

> Some other stuff I've tried to cover:
>...
> 4. I've found that big rows don't play with COPY feature without
> memory limitation, so probably some special handling should be done
> here, it's nonsense , but:
>
>     postgres@postgres:1236 : 15836 # INSERT INTO local_t1 (id, t)
> SELECT s, repeat(md5(s::text), 10000000) from generate_series(100,
> 103) s;
>     2025-10-17 11:17:08.742 CEST [15836] LOG:  statement: INSERT INTO
> local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from
> generate_series(100, 103) s;
>     2025-10-17 11:17:08.743 CEST [15838] LOG:  statement: START
> TRANSACTION ISOLATION LEVEL REPEATABLE READ
>     2025-10-17 11:17:38.302 CEST [15838] LOG:  statement: COPY
> public.t1(id, t, counter) FROM STDIN (FORMAT TEXT, DELIMITER ',')
>     ERROR:  string buffer exceeds maximum allowed length (1073741823 bytes)
>     DETAIL:  Cannot enlarge string buffer containing 960000028 bytes
> by 320000000 more bytes.
>     2025-10-17 11:17:40.213 CEST [15836] ERROR:  string buffer exceeds
> maximum allowed length (1073741823 bytes)
>     2025-10-17 11:17:40.213 CEST [15836] DETAIL:  Cannot enlarge
> string buffer containing 960000028 bytes by 320000000 more bytes.
>     2025-10-17 11:17:40.213 CEST [15836] STATEMENT:  INSERT INTO
> local_t1 (id, t) SELECT s, repeat(md5(s::text), 10000000) from
> generate_series(100, 103) s;
>
>     but then it never wants to finish that backend (constant loop[ in
> PQCleanup() or somewhere close to that), server behaves unstable.
>     Without batch_size set the very same INSERT behaves OK.
>
On the last version that I sent on [1] I introduce a buffer size limit,
and testing this INSERT statement with the latest version seems to fix
this issue. Could you please check this too?

[1] https://www.postgresql.org/message-id/CAFY6G8ePwjT8GiJX1AK5FDMhfq-sOnny6optgTPg98HQw7oJ0g%40mail.gmail.com

--
Matheus Alcantara

Re: postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

23 октября, 03:00:46

On Tue Oct 21, 2025 at 11:25 AM -03, Matheus Alcantara wrote:
>>> Lastly, I don't know if we should change the EXPLAIN(ANALYZE, VERBOSE)
>>> output for batch inserts that use the COPY to mention that we are
>>> sending the COPY command to the remote server. I guess so?
>>>
>>
>> Good point. We definitely should not show SQL for INSERT, when we're
>> actually running a COPY.
>>
> This seems a bit tricky to implement. The COPY is used based on the
> number of slots into the TupleTableSlot array that is used for batch
> insert. The numSlots that execute_foreign_modify() receive is coming
> from ResultRelInfo->ri_NumSlots during ExecInsert(). We don't have this
> information during EXPLAIN that is handled by
> postgresExplainForeignModify(), we only have the
> ResultRelInfo->ri_BatchSize at this stage. The current idea is to use
> the COPY command if the number of slots is > 1 so I'm wondering if we
> should use another mechanism to enable the COPY usage, for example, we
> could just use if the batch_size is configured to a number greater than
> X, but what if the INSERT statement is only inserting a single row,
> should we still use the COPY command to ingest a single row into the
> foreign table? Any thoughts?
>
Thinking more about this I realize that when we are deparsing the remote
SQL to be sent to the foreign server at the planner phase (via
postgresPlanForeignModify()) we don't have the batch_size and number of
rows information, so currently we can not know at the plan time if the
COPY usage for a batch insert is visible or not because IIUC these
information are only visible at query runtime.

One way to make it possible is that we could simply use the
PgFdwModifyState->copy_data during postgresExplainForeignModify() if
it's not null. Since we will only have this information during query
execution the drawback of this approach is that we would only show the
COPY as a Remote SQL on during EXPLAIN(ANALYZE).

Please see the attached v3 version that implements this idea.

> I tried to reuse the fmstate->query field to cache the COPY sql but
> running the postgres_fdw.sql regress test shows that this may not
> work. When we are running a user supplied COPY command on a foreign
> table the CopyMultiInsertBufferFlush() call
> ri_FdwRoutine->ExecForeignBatchInsert which may pass different values
> for numSlots based on the number of slots already sent to the foreign
> server, and eventually it may pass numSlots as 1 which will not use the
> COPY under the hood to send to the foreign server and if we cache the
> COPY command into the fmstate->query this will not work because the
> normal INSERT path on execute_foreign_modify uses the fmstate->query to
> build a prepared statement to send to the foreign server. So basically
> what I'm trying to say is that when the server is executing a COPY into
> a foreign it may use the COPY command or INSERT command to send the data
> to the foreign server. That being said, I decided to create a new
> copy_query field on PgFdwModifyState to cache only COPY commands. Please
> let me know if my understanding is wrong or if we could have a better
> approach here.
>
Based on the information that I've mention above I think that we need
some way to not mix INSERT with COPY commands when executing a COPY in
a foreign table supplied by the user. Or we should disable the COPY
under the hood and always fallback to INSERT or enable the COPY to use
when the *numSlots is 1, so in case of an EXPLAIN(ANALYZE) output we can
show the Remote SQL correctly. Is that make sense?

I'm still not sure if the trigger to use the COPY command for batch
insert should be *numSlots > 1 or something else. I'm open for better
ideas.

Thoughts?

--
Matheus Alcantara

Вложения

v3-0001-postgres_fdw-Use-COPY-to-speed-up-batch-inserts.patch

Re: postgres_fdw: Use COPY to speed up batch inserts

От

jian he

Дата:

23 октября, 12:49:13

On Thu, Oct 23, 2025 at 8:01 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
>
> Please see the attached v3 version that implements this idea.
>
hi.

I am not famailith with this module.
some of the foreach can be replaced with foreach_int.

I suspect that somewhere Form_pg_attribute.attisdropped is not handled properly.
the following setup will crash.

---source database
drop table batch_table1;
create table batch_table1(x int);

---foreign table database
drop foreign table if exists ftable1;
CREATE FOREIGN TABLE ftable1 ( x int ) SERVER loopback1 OPTIONS (
table_name 'batch_table1', batch_size '10' );
ALTER FOREIGN TABLE ftable1 DROP COLUMN x;
ALTER FOREIGN TABLE ftable1 add COLUMN x int;

INSERT INTO ftable SELECT * FROM generate_series(1, 10) i; --- this
will cause server crash.

Re: postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

24 октября, 21:27:01

Hi, thanks for testing this patch!

On Thu Oct 23, 2025 at 6:49 AM -03, jian he wrote:
> On Thu, Oct 23, 2025 at 8:01 AM Matheus Alcantara
> <matheusssilv97@gmail.com> wrote:
>>
>> Please see the attached v3 version that implements this idea.
>>
> hi.
>
> I am not famailith with this module.
> some of the foreach can be replaced with foreach_int.
>
Fixed.

> I suspect that somewhere Form_pg_attribute.attisdropped is not handled properly.
> the following setup will crash.
>
> ---source database
> drop table batch_table1;
> create table batch_table1(x int);
>
> ---foreign table database
> drop foreign table if exists ftable1;
> CREATE FOREIGN TABLE ftable1 ( x int ) SERVER loopback1 OPTIONS (
> table_name 'batch_table1', batch_size '10' );
> ALTER FOREIGN TABLE ftable1 DROP COLUMN x;
> ALTER FOREIGN TABLE ftable1 add COLUMN x int;
>
> INSERT INTO ftable SELECT * FROM generate_series(1, 10) i; --- this
> will cause server crash.
>
I've tested this scenario and the field attisdropped is still being set
to false. After some debugging I realize that the problem was how I was
accessing the fmstate->p_flinfo array - I was using the attum-1 but I
don't think that it's correct.

On create_foreign_modify() we have the following code:

fmstate->p_flinfo = (FmgrInfo *) palloc0(sizeof(FmgrInfo) * n_params);
fmstate->p_nums = 0;
if (operation == CMD_INSERT || operation == CMD_UPDATE)
{
    /* Set up for remaining transmittable parameters */
    foreach(lc, fmstate->target_attrs)
    {
        int            attnum = lfirst_int(lc);
        Form_pg_attribute attr = TupleDescAttr(tupdesc, attnum - 1);

        Assert(!attr->attisdropped);

        /* Ignore generated columns; they are set to DEFAULT */
        if (attr->attgenerated)
            continue;
        getTypeOutputInfo(attr->atttypid, &typefnoid, &isvarlena);
        fmgr_info(typefnoid, &fmstate->p_flinfo[fmstate->p_nums]);
        fmstate->p_nums++;
    }
}

So I think that I should access fmstate->p_flinfo array when looping
through the target_attrs using an int value starting at 0 and ++ after
each iteration. Although I'm not sure if my understanding is fully
correct I've implemented this on the attached patch and it seems to fix
the error.

On this new version I also added some regress tests on postgres_fdw.sql

--
Matheus Alcantara

Вложения

v4-0001-postgres_fdw-Use-COPY-to-speed-up-batch-inserts.patch

Re: postgres_fdw: Use COPY to speed up batch inserts

От

jian he

Дата:

29 октября, 06:10:55

On Sat, Oct 25, 2025 at 2:27 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
>
> On this new version I also added some regress tests on postgres_fdw.sql
>

In the CopyFrom function, we have the CopyInsertMethod, CIM_SINGLE is slower
than CIM_MULTI, I think.
We should do performance tests for the case where the COPY statement is limited
to use CIM_SINGLE.

You can use triggers to make COPY can only use the CIM_SINGLE copymethod.
for example:
create function dummy() returns trigger as $$ begin return new; end $$
language plpgsql;
create trigger dummy
    before insert or update on batch_table_3
    for each row execute procedure dummy();

My local tests show that when batch_size is greater than 2, COPY performs faster
than batch inserts into a foreign table, even though COPY can only use
CIM_SINGLE.
However, my tests were done with an enable-assert build, since I
encountered issues compiling the release build.

anyway, I am sharing my test script.

Вложения

fdw_copy_test.sql

Re: postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

30 октября, 03:28:41

On Wed Oct 29, 2025 at 12:10 AM -03, jian he wrote:
> On Sat, Oct 25, 2025 at 2:27 AM Matheus Alcantara
> <matheusssilv97@gmail.com> wrote:
>>
>> On this new version I also added some regress tests on postgres_fdw.sql
>>
>
> In the CopyFrom function, we have the CopyInsertMethod, CIM_SINGLE is slower
> than CIM_MULTI, I think.
> We should do performance tests for the case where the COPY statement is limited
> to use CIM_SINGLE.
>
> You can use triggers to make COPY can only use the CIM_SINGLE copymethod.
> for example:
> create function dummy() returns trigger as $$ begin return new; end $$
> language plpgsql;
> create trigger dummy
>     before insert or update on batch_table_3
>     for each row execute procedure dummy();
>
> My local tests show that when batch_size is greater than 2, COPY performs faster
> than batch inserts into a foreign table, even though COPY can only use
> CIM_SINGLE.
> However, my tests were done with an enable-assert build, since I
> encountered issues compiling the release build.
>
> anyway, I am sharing my test script.
>
I've benchmarked using buildtype=release with Dcassert=false and
buildtype=debug with Dcassert=true and in both cases I've got a worst
performance when using the COPY for batching insert into a a foreign
table with a trigger. See the results (best of 4 runs).

Batch using INSERT
batch_size: 100
buildtype=debug
Dcassert=true
    tps = 13.596754

Batch using COPY
batch_size: 100
buildtype=debug
Dcassert=true
    tps = 11.650642

--------------------

Batch using INSERT
batch_size: 100
buildtype=release
Dcassert=false
    tps = 28.333161


Batch using COPY
batch_size: 100
buildtype=release
Dcassert=false
    tps = 18.499420


It seems to me that we need to disable the COPY usage when the foreign
table has triggers enabled.

--
Matheus Alcantara

Re: postgres_fdw: Use COPY to speed up batch inserts

От

Andrew Dunstan

Дата:

30 октября, 19:32:03

On 2025-10-29 We 8:28 PM, Matheus Alcantara wrote:
> On Wed Oct 29, 2025 at 12:10 AM -03, jian he wrote:
>> On Sat, Oct 25, 2025 at 2:27 AM Matheus Alcantara
>> <matheusssilv97@gmail.com> wrote:
>>> On this new version I also added some regress tests on postgres_fdw.sql
>>>
>> In the CopyFrom function, we have the CopyInsertMethod, CIM_SINGLE is slower
>> than CIM_MULTI, I think.
>> We should do performance tests for the case where the COPY statement is limited
>> to use CIM_SINGLE.
>>
>> You can use triggers to make COPY can only use the CIM_SINGLE copymethod.
>> for example:
>> create function dummy() returns trigger as $$ begin return new; end $$
>> language plpgsql;
>> create trigger dummy
>>      before insert or update on batch_table_3
>>      for each row execute procedure dummy();
>>
>> My local tests show that when batch_size is greater than 2, COPY performs faster
>> than batch inserts into a foreign table, even though COPY can only use
>> CIM_SINGLE.
>> However, my tests were done with an enable-assert build, since I
>> encountered issues compiling the release build.
>>
>> anyway, I am sharing my test script.
>>
> I've benchmarked using buildtype=release with Dcassert=false and
> buildtype=debug with Dcassert=true and in both cases I've got a worst
> performance when using the COPY for batching insert into a a foreign
> table with a trigger. See the results (best of 4 runs).
>
> Batch using INSERT
> batch_size: 100
> buildtype=debug
> Dcassert=true
>      tps = 13.596754
>
> Batch using COPY
> batch_size: 100
> buildtype=debug
> Dcassert=true
>      tps = 11.650642
>
> --------------------
>
> Batch using INSERT
> batch_size: 100
> buildtype=release
> Dcassert=false
>      tps = 28.333161
>
>
> Batch using COPY
> batch_size: 100
> buildtype=release
> Dcassert=false
>      tps = 18.499420
>
>
> It seems to me that we need to disable the COPY usage when the foreign
> table has triggers enabled.
>

I think it's probably worth finding out why COPY is so much worse in the 
presence of triggers. Is there something we can do to improve that, at 
least so it's no worse?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

31 октября, 22:02:05

On Thu Oct 30, 2025 at 1:32 PM -03, Andrew Dunstan wrote:
>> It seems to me that we need to disable the COPY usage when the foreign
>> table has triggers enabled.
>>
>
> I think it's probably worth finding out why COPY is so much worse in the
> presence of triggers. Is there something we can do to improve that, at
> least so it's no worse?
>
I did a bit of reading on CopyFrom() and found comments that clarifies
why the use of COPY for batch inserts is slower when the target foreign
table has triggers.
    /*
     * It's generally more efficient to prepare a bunch of tuples for
     * insertion, and insert them in one
     * table_multi_insert()/ExecForeignBatchInsert() call, than call
     * table_tuple_insert()/ExecForeignInsert() separately for every tuple.
     * However, there are a number of reasons why we might not be able to do
     * this.  These are explained below.
     */
    if (resultRelInfo->ri_TrigDesc != NULL &&
        (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
         resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
    {
        /*
         * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
         * triggers on the table. Such triggers might query the table we're
         * inserting into and act differently if the tuples that have already
         * been processed and prepared for insertion are not there.
         */
        insertMethod = CIM_SINGLE;
    }

It forces the use of CIM_SINGLE if a BEFORE or INSTEAD OF trigger is
present. This method then relies on table_tuple_insert() or
ExecForeignInsert() to insert one tuple at a time.

IUUC we cannot determine during execute_foreign_modify() whether a
foreign table has triggers enabled on the target remote table. This lack
of information suggests that we would need to query the foreign server
to find out. If it's the only viable path I think that we would have
issues if the the user configured to access the foreign server don't
have access on catalog tables.

It's showing a bit complicated to decide at runtime if we should use the
COPY or INSERT for batch insert into a foreign table. Perhaps we could
add a new option on CREATE FOREIGN TABLE to enable this usage or not? We
could document the performance improvements and the limitations so the
user can decide if it should enable or not.

--
Matheus Alcantara

Re: postgres_fdw: Use COPY to speed up batch inserts

От

"Matheus Alcantara"

Дата:

07 ноября, 02:49:37

On Fri Oct 31, 2025 at 4:02 PM -03, I wrote:
> It's showing a bit complicated to decide at runtime if we should use the
> COPY or INSERT for batch insert into a foreign table. Perhaps we could
> add a new option on CREATE FOREIGN TABLE to enable this usage or not? We
> could document the performance improvements and the limitations so the
> user can decide if it should enable or not.
>
Here is v5 that implement this idea.

On this version I've introduced a foreign table and foreign server
option "use_copy_for_insert" (I'm open for a better name) that enable
the use of the COPY as remote command to execute an INSERT into a
foreign table. The COPY can be used if the user enable this option on
the foreign table or the foreign server and if the original INSERT
statement don't have a RETURNING clause.

See the benchmark results:

pgbench -n -c 10 -j 10 -t 100 -f bench.sql postgres

Master (batch_size = 1 with a single row to insert):
tps = 16000.768037

Master (batch_size = 1 with 1000 rows to insert):
tps = 133.451518

Master (batch_size = 100 with 1000 rows to insert):
tps = 1274.096347

-----------------

Patch(batch_size = 1, use_copy_for_insert = false with single row to
insert)
tps = 15734.155705

Master (batch_size = 1, use_copy_for_insert = false with 1000 rows to
insert):
tps = 132.644801

Master (batch_size = 100, use_copy_for_insert = false with 1000 rows to
insert):
tps = 1245.514591

-----------------

Patch(batch_size = 1, use_copy_for_insert = true with single row to
insert)
tps = 17604.394057

Master (batch_size = 1, use_copy_for_insert = true with 1000 rows to
insert):
tps = 88.998804

Master (batch_size = 100, use_copy_for_insert = true with 1000 rows to
insert):
tps = 2406.009249

-----------------

We can see that when batching inserting with the batch_size configured
properly we have a very significant performance improvement and when the
"use_copy_for_insert" option is disabled the performance are close
compared with master.

The problem is when the "batch_size" is 1 (default) and
"use_copy_for_insert" is enabled. This is because on this scenario we
are sending multiple COPY commands with a single row to the foreign
server.

One way to fix this would to decide at runtime (at
execute_foreign_modify()) if the COPY can be used based on the number of
rows being insert. I don't think that I like this option because it
would make the EXPLAIN output different when the ANALYZE option is used
since during planning time we don't have the number of rows being
inserted, so if just EXPLAIN(VERBOSE) is executed we would show the
INSERT as remote SQL, and if the ANALYZE is included and we have enough
rows to enable the COPY usage, the remote SQL would show the COPY
command.

Since the new "use_copy_for_insert" option is be disabled by default I
think that we could document this limitation and mention the performance
improvements when used correctly with the batch_size option.

Another option would be to use the COPY command only if the
"use_copy_for_insert" is true and also if the "batch_size" is > 1. We
would still have the performance issue if the user insert a single row
but we would close to less scenarios. The attached 0002 implement this
idea.

Thoughts?

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: postgres_fdw: Use COPY to speed up batch inserts

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения