Обсуждение: Re: Bypassing cursors in postgres_fdw to enable parallel plans

Поиск
Список
Период
Сортировка

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Andy Fan
Дата:
Rafia Sabih <rafia.pghackers@gmail.com> writes:


Hi,

>
> At present, in postgres_fdw, if a query which is using a parallel plan is fired from the remote end fails to use the
> parallel plan locally because of the presence of CURSORS. Consider the following example,
...
>
> Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing
the
> cursors.

Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?


    /*
     * Set up parallel mode if appropriate.
     *
     * Parallel mode only supports complete execution of a plan.  If we've
     * already partially executed it, or if the caller asks us to exit early,
     * we must force the plan to run without parallelism.
     */
    if (queryDesc->already_executed || numberTuples != 0)
             use_parallel_mode = false;

Actually I can't understand the comment as well and I had this
confusion for a long time.

-- 
Best Regards
Andy Fan




Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Rafia Sabih
Дата:
Hello hackers,

I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
And when we switch to the next query then we do not have a way to fetch the tuples for the previous query. 
So, the tuples retrieved earlier for the first query were lost if not saved.
I would highly appreciate your time and feedback for this.

On Wed, 12 Mar 2025 at 12:57, Andy Fan <zhihuifan1213@163.com> wrote:
Rafia Sabih <rafia.pghackers@gmail.com> writes:


Hi,

>
> At present, in postgres_fdw, if a query which is using a parallel plan is fired from the remote end fails to use the
> parallel plan locally because of the presence of CURSORS. Consider the following example,
...
>
> Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing the
> cursors.

Do you know why we can't use parallel plan when cursor is used? Is It
related to this code in ExecutePlan?


        /*
         * Set up parallel mode if appropriate.
         *
         * Parallel mode only supports complete execution of a plan.  If we've
         * already partially executed it, or if the caller asks us to exit early,
         * we must force the plan to run without parallelism.
         */
        if (queryDesc->already_executed || numberTuples != 0)
                use_parallel_mode = false;

Actually I can't understand the comment as well and I had this
confusion for a long time.

--
Best Regards
Andy Fan



--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Вложения

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Robert Haas
Дата:
On Wed, Mar 12, 2025 at 7:57 AM Andy Fan <zhihuifan1213@163.com> wrote:
> Do you know why we can't use parallel plan when cursor is used? Is It
> related to this code in ExecutePlan?

Yes. When a cursor is used, the whole query isn't executed all at
once, but rather the executor will be started and stopped for each
fetch from the cursor. We can't keep the parallel workers running for
that whole time, not just because it would be inefficient, but because
it would be incorrect. State changes would be possible in the leader
that were not reflected in the workers, leading to chaos.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Robert Haas
Дата:
On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the
query.

This might have the advantage of being simpler, but it's definitely
worse. If we're only fetching one result set, which will be common,
we'll buffer the whole thing in a tuplestore where that could
otherwise be avoided. Maybe it's still best to go with this approach;
not sure.

> And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> So, the tuples retrieved earlier for the first query were lost if not saved.
> I would highly appreciate your time and feedback for this.

My suggestions are to work on the following areas:

1. Automated testing. The patch has no regression tests, and won't get
committed without those.

2. Manual testing. How does the performance with this new option
compare to the existing method? The answer might be different for
small result sets that fit in memory and large ones that spill to
disk, and will certainly also depend on how productive parallel query
can be on the remote side; but I think you want to do and post on this
list some measurements showing the best and worst case for the patch.

3. Patch clean-up. There are plenty of typos and whitespace-only
changes in the patch. It's best to clean those up. Running pgindent is
a good idea too. Some places could use comments.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Rafia Sabih
Дата:


On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch more, I realised this was not equipped to handle the cases when there are more than two active cursors. So to accommodate such a case, I now modified the new struct for saving the previous query to a list of such structs. Also, it turns out we need not to save the tuples in case this is an active cursor, so we only populate the associated tuplestore only when we need to create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test this patch for all the cases of postgres_fdw, the only way I could figure out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression database and populated the table "S1"."T1" with more tuples to understand the impact of patch on higher scale. I also used auto_explain to get the foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG:  duration: 61.336 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG:  duration: 60.575 ms  plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate  (cost=21888.22..21888.23 rows=1 width=8)
  ->  Seq Scan on "T 1"  (cost=0.00..19956.98 rows=772498 width=0)
 count  
--------
 990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false; 
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG:  duration: 21.875 ms  plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate  (cost=17255.64..17255.65 rows=1 width=8)
  ->  Gather  (cost=17255.43..17255.64 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=16255.43..16255.44 rows=1 width=8)
              ->  Parallel Seq Scan on "T 1"  (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG:  duration: 22.623 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
 count  
--------
 990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
   Hash Cond: (t1."C 1" = t3.c1)
   Buffers: shared hit=12232, temp read=12754 written=12754
   ->  Seq Scan on "T 1" t1  (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
         Buffers: shared hit=12232
   ->  Hash  (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
         Buckets: 262144 (originally 1024)  Batches: 8 (originally 1)  Memory Usage: 6408kB
         Buffers: temp written=2537
         ->  Nested Loop  (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
               ->  Foreign Scan on ft1 t2  (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
               ->  Foreign Scan on ft2 t3  (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
 Planning:
   Buffers: shared hit=5
 Planning Time: 2.211 ms
 Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN                              
----------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
  Hash Cond: (t1."C 1" = t3.c1)
  Buffers: shared hit=12232, temp written=2660
  -> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
     Buffers: shared hit=12232
  -> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
     Buckets: 1024 Batches: 1 Memory Usage: 12kB
     Buffers: temp written=2660
     -> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
        Buffers: temp written=2660
        -> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
        -> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
           Buffers: temp written=2660
 Planning:
  Buffers: shared hit=5
 Planning Time: 2.284 ms
 Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
 Planning Time: 0.479 ms
 Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
 Planning Time: 0.491 ms
 Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Вложения

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
KENAN YILMAZ
Дата:
Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) || md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01 10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user 'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
                                                         QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=23260.306..23261.591 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared read=363637
   I/O Timings: shared read=64590.910
   ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=23242.279..23242.280 rows=0.00 loops=3)
         Filter: (a > 1000)
         Rows Removed by Filter: 6666667
         Buffers: shared read=363637
         I/O Timings: shared read=64590.910
 Planning:
   Buffers: shared hit=54 read=14 dirtied=1
   I/O Timings: shared read=23.281
 Planning Time: 38.734 ms
 Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17 dirtied=1
   I/O Timings: shared read=10.696
 Planning Time: 43.852 ms
 Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: FETCH 100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 20074.799 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: CLOSE c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 20057.543 ms  plan:
        Query Text: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Seq Scan on t1  (cost=0.00..613637.45 rows=1 width=105) (actual time=20057.541..20057.541 rows=0.00 loops=1)
          Filter: (a > 1000)
          Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17
   I/O Timings: shared read=30.074
 Planning Time: 53.678 ms
 Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 24080.958 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 24059.372 ms  plan:
        Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=24058.076..24059.367 rows=0.00 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
                Filter: (a > 1000)
                Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437

---

Kenan YILMAZ



Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01 tarihinde şunu yazdı:


On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch more, I realised this was not equipped to handle the cases when there are more than two active cursors. So to accommodate such a case, I now modified the new struct for saving the previous query to a list of such structs. Also, it turns out we need not to save the tuples in case this is an active cursor, so we only populate the associated tuplestore only when we need to create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test this patch for all the cases of postgres_fdw, the only way I could figure out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression database and populated the table "S1"."T1" with more tuples to understand the impact of patch on higher scale. I also used auto_explain to get the foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG:  duration: 61.336 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG:  duration: 60.575 ms  plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate  (cost=21888.22..21888.23 rows=1 width=8)
  ->  Seq Scan on "T 1"  (cost=0.00..19956.98 rows=772498 width=0)
 count  
--------
 990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false; 
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG:  duration: 21.875 ms  plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate  (cost=17255.64..17255.65 rows=1 width=8)
  ->  Gather  (cost=17255.43..17255.64 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=16255.43..16255.44 rows=1 width=8)
              ->  Parallel Seq Scan on "T 1"  (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG:  duration: 22.623 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
 count  
--------
 990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
   Hash Cond: (t1."C 1" = t3.c1)
   Buffers: shared hit=12232, temp read=12754 written=12754
   ->  Seq Scan on "T 1" t1  (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
         Buffers: shared hit=12232
   ->  Hash  (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
         Buckets: 262144 (originally 1024)  Batches: 8 (originally 1)  Memory Usage: 6408kB
         Buffers: temp written=2537
         ->  Nested Loop  (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
               ->  Foreign Scan on ft1 t2  (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
               ->  Foreign Scan on ft2 t3  (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
 Planning:
   Buffers: shared hit=5
 Planning Time: 2.211 ms
 Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN                              
----------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
  Hash Cond: (t1."C 1" = t3.c1)
  Buffers: shared hit=12232, temp written=2660
  -> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
     Buffers: shared hit=12232
  -> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
     Buckets: 1024 Batches: 1 Memory Usage: 12kB
     Buffers: temp written=2660
     -> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
        Buffers: temp written=2660
        -> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
        -> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
           Buffers: temp written=2660
 Planning:
  Buffers: shared hit=5
 Planning Time: 2.284 ms
 Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
 Planning Time: 0.479 ms
 Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
 Planning Time: 0.491 ms
 Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Rafia Sabih
Дата:
On Tue, 25 Nov 2025 at 15:24, KENAN YILMAZ <kenan.yilmaz@localus.com.tr> wrote:
Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) || md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01 10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user 'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
                                                         QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=23260.306..23261.591 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared read=363637
   I/O Timings: shared read=64590.910
   ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=23242.279..23242.280 rows=0.00 loops=3)
         Filter: (a > 1000)
         Rows Removed by Filter: 6666667
         Buffers: shared read=363637
         I/O Timings: shared read=64590.910
 Planning:
   Buffers: shared hit=54 read=14 dirtied=1
   I/O Timings: shared read=23.281
 Planning Time: 38.734 ms
 Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17 dirtied=1
   I/O Timings: shared read=10.696
 Planning Time: 43.852 ms
 Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: FETCH 100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 20074.799 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: CLOSE c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 20057.543 ms  plan:
        Query Text: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Seq Scan on t1  (cost=0.00..613637.45 rows=1 width=105) (actual time=20057.541..20057.541 rows=0.00 loops=1)
          Filter: (a > 1000)
          Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17
   I/O Timings: shared read=30.074
 Planning Time: 53.678 ms
 Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 24080.958 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 24059.372 ms  plan:
        Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=24058.076..24059.367 rows=0.00 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
                Filter: (a > 1000)
                Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437

---

Kenan YILMAZ
Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you found any case of performance degradation with the patch.
  
Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of last queries instead of inventing something new.
Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to keep track of previously
active cursors. Nothing else is changed in this version of the patch.



Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01 tarihinde şunu yazdı:


On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch more, I realised this was not equipped to handle the cases when there are more than two active cursors. So to accommodate such a case, I now modified the new struct for saving the previous query to a list of such structs. Also, it turns out we need not to save the tuples in case this is an active cursor, so we only populate the associated tuplestore only when we need to create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test this patch for all the cases of postgres_fdw, the only way I could figure out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression database and populated the table "S1"."T1" with more tuples to understand the impact of patch on higher scale. I also used auto_explain to get the foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG:  duration: 61.336 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG:  duration: 60.575 ms  plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate  (cost=21888.22..21888.23 rows=1 width=8)
  ->  Seq Scan on "T 1"  (cost=0.00..19956.98 rows=772498 width=0)
 count  
--------
 990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false; 
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG:  duration: 21.875 ms  plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate  (cost=17255.64..17255.65 rows=1 width=8)
  ->  Gather  (cost=17255.43..17255.64 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=16255.43..16255.44 rows=1 width=8)
              ->  Parallel Seq Scan on "T 1"  (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG:  duration: 22.623 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
 count  
--------
 990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
   Hash Cond: (t1."C 1" = t3.c1)
   Buffers: shared hit=12232, temp read=12754 written=12754
   ->  Seq Scan on "T 1" t1  (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
         Buffers: shared hit=12232
   ->  Hash  (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
         Buckets: 262144 (originally 1024)  Batches: 8 (originally 1)  Memory Usage: 6408kB
         Buffers: temp written=2537
         ->  Nested Loop  (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
               ->  Foreign Scan on ft1 t2  (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
               ->  Foreign Scan on ft2 t3  (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
 Planning:
   Buffers: shared hit=5
 Planning Time: 2.211 ms
 Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN                              
----------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
  Hash Cond: (t1."C 1" = t3.c1)
  Buffers: shared hit=12232, temp written=2660
  -> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
     Buffers: shared hit=12232
  -> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
     Buckets: 1024 Batches: 1 Memory Usage: 12kB
     Buffers: temp written=2660
     -> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
        Buffers: temp written=2660
        -> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
        -> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
           Buffers: temp written=2660
 Planning:
  Buffers: shared hit=5
 Planning Time: 2.284 ms
 Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
 Planning Time: 0.479 ms
 Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
 Planning Time: 0.491 ms
 Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH


--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Вложения

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
KENAN YILMAZ
Дата:
Hello Hackers,

I have executed another quick tests based on v4-0001-Fetch-without-cursors patch.
I also recompiled PostgreSQL from the latest main branch.

Here are the execution time results;

** Local t1 table select (1st run) → 35049.790 ms — Parallel workers launched
** postgres_fdw.use_cursor = true → 18996.236 ms — No parallel workers launched
** postgres_fdw.use_cursor = false → 24962.529 ms — Parallel workers launched
** Local t1 table select (2nd run) → 21406.631 ms — Parallel workers launched

In my opinion, the primary goal is to enable FDW to fully support parallel execution.
From that perspective, these results seem acceptable.

If you would like to run more advanced or complex test queries, please feel free to proceed.

The following tests were executed in the same manner as in my previous test runs;

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
                                                         QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=35042.394..35043.542 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=240 read=363397
   I/O Timings: shared read=99749.585
   ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=35023.992..35023.993 rows=0.00 loops=3)
         Filter: (a > 1000)
         Rows Removed by Filter: 6666667
         Buffers: shared hit=240 read=363397
         I/O Timings: shared read=99749.585
 Planning:
   Buffers: shared hit=64 read=9 dirtied=1
   I/O Timings: shared read=4.656
 Planning Time: 30.145 ms
 Execution Time: 35049.790 ms
(15 rows)

Time: 35154.644 ms (00:35.155)

--
-- * use_cursor = true EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=18981.860..18981.896 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17 dirtied=1
   I/O Timings: shared read=27.012
 Planning Time: 65.290 ms
 Execution Time: 18996.236 ms
(6 rows)

Time: 19121.351 ms (00:19.121)

--> From Log Files
--> From Log Files
[197885]: line=11 sid=6929af4a.304fd tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[197885]: line=12 sid=6929af4a.304fd tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[197885]: line=13 sid=6929af4a.304fd tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[197885]: line=14 sid=6929af4a.304fd tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[197885]: line=15 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: FETCH 100 FROM c1
[197885]: line=16 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[197882]: line=5 sid=6929af48.304fa tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 18981.899 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=18981.860..18981.896 rows=0.00 loops=1)
[197882]: line=6 sid=6929af48.304fa tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[197885]: line=17 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: CLOSE c1
[197885]: line=18 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[197885]: line=19 sid=6929af4a.304fd tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 18950.616 ms  plan:
        Query Text: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Seq Scan on t1  (cost=0.00..613637.45 rows=1 width=105) (actual time=18950.614..18950.614 rows=0.00 loops=1)
          Filter: (a > 1000)
          Rows Removed by Filter: 20000000
[197885]: line=20 sid=6929af4a.304fd tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[197885]: line=21 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION
[197885]: line=22 sid=6929af4a.304fd tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24940.034..24940.054 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17
   I/O Timings: shared read=34.719
 Planning Time: 62.300 ms
 Execution Time: 24962.529 ms
(6 rows)

Time: 25124.566 ms (00:25.125)

--> From Log Files
[197919]: line=11 sid=6929afcb.3051f tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[197919]: line=12 sid=6929afcb.3051f tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[197919]: line=13 sid=6929afcb.3051f tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[197919]: line=14 sid=6929afcb.3051f tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[197917]: line=7 sid=6929afbd.3051d tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 24940.057 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24940.034..24940.054 rows=0.00 loops=1)
[197917]: line=8 sid=6929afbd.3051d tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[197919]: line=15 sid=6929afcb.3051f tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION
[197919]: line=16 sid=6929afcb.3051f tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[197919]: line=17 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 24908.608 ms  plan:
        Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=24906.370..24908.603 rows=0.00 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=24895.314..24895.314 rows=0.00 loops=3)
                Filter: (a > 1000)
                Rows Removed by Filter: 6666667
[197919]: line=18 sid=6929afcb.3051f tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437

---
--
-- * Local t1 table EXPLAIN results (again)
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
                                                         QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=0.00..467803.67 rows=1 width=105) (actual time=21403.492..21405.381 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared read=363637
   I/O Timings: shared read=59425.134
   ->  Parallel Seq Scan on t1  (cost=0.00..467803.67 rows=1 width=105) (actual time=21394.997..21394.997 rows=0.00 loops=3)
         Filter: (a > 1000)
         Rows Removed by Filter: 6666667
         Buffers: shared read=363637
         I/O Timings: shared read=59425.134
 Planning:
   Buffers: shared hit=54 read=14
   I/O Timings: shared read=30.482
 Planning Time: 37.955 ms
 Execution Time: 21406.631 ms
(15 rows)

Time: 21506.079 ms (00:21.506)

---
Kenan YILMAZ


Rafia Sabih <rafia.pghackers@gmail.com>, 27 Kas 2025 Per, 13:50 tarihinde şunu yazdı:
On Tue, 25 Nov 2025 at 15:24, KENAN YILMAZ <kenan.yilmaz@localus.com.tr> wrote:
Hello Hackers,

I have executed use_cursor = true/false with quick tests.

The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) || md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01 10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user 'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
                                                         QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=23260.306..23261.591 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared read=363637
   I/O Timings: shared read=64590.910
   ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=23242.279..23242.280 rows=0.00 loops=3)
         Filter: (a > 1000)
         Rows Removed by Filter: 6666667
         Buffers: shared read=363637
         I/O Timings: shared read=64590.910
 Planning:
   Buffers: shared hit=54 read=14 dirtied=1
   I/O Timings: shared read=23.281
 Planning Time: 38.734 ms
 Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17 dirtied=1
   I/O Timings: shared read=10.696
 Planning Time: 43.852 ms
 Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: FETCH 100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 20074.799 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: CLOSE c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 20057.543 ms  plan:
        Query Text: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Seq Scan on t1  (cost=0.00..613637.45 rows=1 width=105) (actual time=20057.541..20057.541 rows=0.00 loops=1)
          Filter: (a > 1000)
          Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17
   I/O Timings: shared read=30.074
 Planning Time: 53.678 ms
 Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message, postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOG:  00000: duration: 24080.958 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query, postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 24059.372 ms  plan:
        Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Gather  (cost=0.00..467803.85 rows=1 width=105) (actual time=24058.076..24059.367 rows=0.00 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
                Filter: (a > 1000)
                Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd, auto_explain.c:437

---

Kenan YILMAZ
Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you found any case of performance degradation with the patch.
  
Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of last queries instead of inventing something new.
Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to keep track of previously
active cursors. Nothing else is changed in this version of the patch.



Rafia Sabih <rafia.pghackers@gmail.com>, 25 Kas 2025 Sal, 17:01 tarihinde şunu yazdı:


On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > I am back at this work with a rebased and revised patch. The new version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are always saving the tuples
> > in the tuplestore, this is because we do not have any means to know beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch more, I realised this was not equipped to handle the cases when there are more than two active cursors. So to accommodate such a case, I now modified the new struct for saving the previous query to a list of such structs. Also, it turns out we need not to save the tuples in case this is an active cursor, so we only populate the associated tuplestore only when we need to create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test this patch for all the cases of postgres_fdw, the only way I could figure out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression database and populated the table "S1"."T1" with more tuples to understand the impact of patch on higher scale. I also used auto_explain to get the foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG:  duration: 61.336 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG:  duration: 60.575 ms  plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate  (cost=21888.22..21888.23 rows=1 width=8)
  ->  Seq Scan on "T 1"  (cost=0.00..19956.98 rows=772498 width=0)
 count  
--------
 990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false; 
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG:  duration: 21.875 ms  plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate  (cost=17255.64..17255.65 rows=1 width=8)
  ->  Gather  (cost=17255.43..17255.64 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=16255.43..16255.44 rows=1 width=8)
              ->  Parallel Seq Scan on "T 1"  (cost=0.00..15450.74 rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG:  duration: 22.623 ms  plan:
Query Text: select count(*) from ft1;
Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
  Relations: Aggregate on (ft1)
 count  
--------
 990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");2025-11-14 15:57:46.893 CET [1946]
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=109013.50..131877.35 rows=772498 width=4) (actual time=112311.578..112804.516 rows=990821.00 loops=1)
   Hash Cond: (t1."C 1" = t3.c1)
   Buffers: shared hit=12232, temp read=12754 written=12754
   ->  Seq Scan on "T 1" t1  (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.039..48.808 rows=990821.00 loops=1)
         Buffers: shared hit=12232
   ->  Hash  (cost=109001.00..109001.00 rows=1000 width=4) (actual time=112310.386..112310.387 rows=990821.00 loops=1)
         Buckets: 262144 (originally 1024)  Batches: 8 (originally 1)  Memory Usage: 6408kB
         Buffers: temp written=2537
         ->  Nested Loop  (cost=200.43..109001.00 rows=1000 width=4) (actual time=0.728..112030.241 rows=990821.00 loops=1)
               ->  Foreign Scan on ft1 t2  (cost=100.00..331.00 rows=1000 width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
               ->  Foreign Scan on ft2 t3  (cost=100.43..108.66 rows=1 width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
 Planning:
   Buffers: shared hit=5
 Planning Time: 2.211 ms
 Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN                              
----------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual time=261.416..354.520 rows=990821.00 loops=1)
  Hash Cond: (t1."C 1" = t3.c1)
  Buffers: shared hit=12232, temp written=2660
  -> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual time=0.021..35.531 rows=990821.00 loops=1)
     Buffers: shared hit=12232
  -> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual time=261.381..261.383 rows=100.00 loops=1)
     Buckets: 1024 Batches: 1 Memory Usage: 12kB
     Buffers: temp written=2660
     -> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual time=255.563..261.356 rows=100.00 loops=1)
        Buffers: temp written=2660
        -> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4) (actual time=0.433..0.443 rows=100.00 loops=1)
        -> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4) (actual time=2.609..2.609 rows=1.00 loops=100)
           Buffers: temp written=2660
 Planning:
  Buffers: shared hit=5
 Planning Time: 2.284 ms
 Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly better. I investigated a bit more to find out why the query was so slow with the cursors,
and came to understand that it is repeatedly abandoning and recreating the cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=4537.437..4598.483 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=3748.488..4090.547 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1818.521..1865.792 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1929.955..1981.104 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
 Planning Time: 0.479 ms
 Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b where a.c1 = b.c1 order by a.c2;
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=831.49..833.99 rows=1000 width=94) (actual time=3376.385..3435.406 rows=990821.00 loops=1)
   Sort Key: a.c2
   Sort Method: external merge  Disk: 137768kB
   Buffers: temp read=83156 written=83253
   ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual time=2565.517..2916.814 rows=990821.00 loops=1)
         Merge Cond: (a.c1 = b.c1)
         Buffers: temp read=48725 written=48779
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1249.517..1300.132 rows=990821.00 loops=1)
               Sort Key: a.c1
               Sort Method: external merge  Disk: 75664kB
               Buffers: temp read=18910 written=18937
               ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000 width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
         ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual time=1315.990..1369.576 rows=990821.00 loops=1)
               Sort Key: b.c1
               Sort Method: external sort  Disk: 79520kB
               Buffers: temp read=29815 written=29842
               ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000 width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
 Planning Time: 0.491 ms
 Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH


--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Robert Haas
Дата:
On Thu, Nov 27, 2025 at 5:50 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you
foundany case of performance degradation with the patch. 
>
> Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of
lastqueries instead of inventing something new. 
> Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of
PgFdwConnStateto keep track of previously 
> active cursors. Nothing else is changed in this version of the patch.

Thanks for your continued work on this topic.

As we discussed off-list, the regression tests in this patch need
quite a bit more work. We shouldn't just repeat large numbers of tests
-- we should repeat enough to provide meaningful test coverage of the
new feature, and maybe add a few new ones that specifically target
scenarios that the patch is intended to cover. One somewhat bizarre
thing that I noticed scrolling through the regression test outputs is
this:

+-- Test with non-cursor mode
+SET postgres_fdw.use_cursor = false;
+SET postgres_fdw.use_cursor = true;

This happens in multiple places in the regression tests, and doesn't
really make any sense to me, because changing the GUC from to false,
doing nothing, and then changing it back to true doesn't seem like a
useful test scenario, and definitely doesn't seem like a test scenario
that we need to repeat multiple times. Honestly, I wonder how this
happened. Did you maybe run a script over the .sql files to generate
the new versions and then not check what the output actually looked
like? I really can't stress enough the need to be thoughtful about
constructing test cases for a patch of this type.

A related problem is that you've included 12,721 lines of useless
output in the patch file, because somehow postgres_fdw.out.orig got
checked into the repository. It's always a good idea, when posting a
patch to the list, to check the diffstat to make sure that there's
nothing in there that you don't expect to see. The presence of this
line just below your commit message could have alerted you to this
problem:

 .../expected/postgres_fdw.out.orig            | 12721 ++++++++++++++++

Ideally, I really recommend scrolling through the patch, not just the
diffstat, to make sure that everything is the way you want it to be,
before giving it to others to look at.

There are a number of other cosmetic problems with this patch that
make it hard to review the actual code changes. For instance:

+ /* To be used only in non-cursor mode */
+ Tuplestorestate *tuplestore;
+ TupleTableSlot *slot;
+ bool tuples_ready;
+ struct PgFdwScanState *next;

The comment is good, but it only explains a general fact about all
four of these fields. It doesn't explain specifically what each of
these fields is intended to do. Naming a field very generally, like
"next", when it must have some very specific purpose that is not
documented, makes it really hard for someone reading the code -- in
this case, me -- to understand what the point is. So, these fields
should probably have comments, but also, some of them probably need to
be renamed. Maybe "tuplestore" and "slot" are OK but "next" is not
going to be good enough.

+/* Fill the Tupleslot when another query needs to execute. */
+static void
+fillTupleSlot(PgFdwScanState *fsstate, ForeignScanState *node)

I think you're filling a tuple store, not a tuple slot.

+ initStringInfo(&buf);

What seems to happen here is that you create an empty buffer, add
fsstate->query to it and nothing else, and then use buf.data. So why
not just get rid of buf and use fsstate->query directly?

+ /* Always fetch the last prev_query */
+ for (; p1->next != NULL; p1 = p1->next);

There are multiple problems here. First, this code is not per
PostgreSQL style and would be reindented by pgindent, which you should
make a habit of running before posting. Second, p1 is not a
particularly informative or understandable variable name. Third, why
are we using a linked list if the value we're going to need is at the
end of the list? If we need to be able to access the last element of
the list efficiently, maybe we should be keeping the list in reverse
order, or maybe we should be using a List which permits efficient
random access.

But looking quickly through the patch, I have an even bigger question
here. It doesn't really seem like we ever care about any element of
the list other than the last. It looks like we go to a fair amount of
trouble to maintain the list at various points in the code, but it
seems like when we access the list, we just always go to the end. That
makes me wonder why we even need a list. Perhaps instead of
PgFdwConnState storing a pointer to "prev_queries", it should just
store a pointer to the PgFdwScanState for the query that is currently
"using" the connection, if there is one. That is, the query whose
result set is currently being received, and which must be buffered
into a tuplestore before using the connection for something else. When
we've done that, we reset the field to NULL, and we don't maintain any
list of the older "previous queries". Maybe there's some problem with
that idea, but that gets back to my earlier point about comments: the
comment for "next" (or however it got renamed) ought to be explaining
something about why it exists and what it's for. Without such a
comment, the only ways for me to be sure whether we really need "next"
is to either (a) ask you or (b) spend a long time staring at the code
to try to figure it out or (c) try removing it and see if it works.

Ah, wait! I just found some code that cares about the list but not
just about the last element:

+ for (; p1->next != NULL; p1 = p1->next)
+ /* find the correct prev_query */
+ {
+ if ((p1->tuples_ready && fsstate->cursor_number == p1->cursor_number))
+ cur_prev = p1;
+ }

But I'm still confused. Why do we need to iterate through the
prev_queries list to find the correct ForeignScanState? Isn't that
exactly what fsstate is here? I'm inclined to think that this is just
a complicated way of setting p1 = fsstate and then setting cur_prev =
p1, so we could skip all this and just test whether
fsstate->tuplestore != NULL and if so retrieve tuples from there. If
there's some reason why fsstate and cur_prev would end up being
different, then there should be some comment explaining it. I'm
inclined to think that would be a sign of a design flaw: I think the
idea here should be to make use of the ForeignScanState object that
already exists to hold the details that we need for this feature,
rather than creating a new one. But, if there were a comment telling
me why it's like this, I might change my mind.

+ /* Clear the last query details, once tuples are retrieved. */
+ if (fsstate->conn_state->prev_queries == cur_prev)
+ {
+ /*
+ * This is the first prev query in the list, check if there
+ * are more
+ */
+ if (fsstate->conn_state->num_prev_queries > 1)
+
+ /*
+ * fsstate->conn_state->prev_queries->next =
+ * cur_prev->next;
+ */
+ fsstate->next = cur_prev->next;
+ clear_prev_query_state(cur_prev);
+ }

This code can remove an item from the prev_queries list, but only if
it's the first item. Now, if my analysis above is correct and we don't
even need the prev_queries list, then it doesn't matter; all this
logic can basically go away. If we do need the prev_queries list, then
I don't think it can be right to only be able to remove the first
element of the list. There's no reason why the oldest prev_query has
to finish first. What if there are three queries in the list and the
middle one finishes first? Then this will just leave it in the list,
whereas if the first one had finished first, it would have been
deleted from the list. That kind of inconsistency doesn't seem like a
good idea.

+ /*
+ * This is to fetch all the tuples of this query and save
+ * them in Tuple Slot. Since it is using
+ * PQSetChunkedRowsMode, we only get the
+ * fsstate->fetch_size tuples in one run, so keep on
+ * executing till we get NULL in PGresult i.e. all the
+ * tuples are retrieved.
+ */
+ p1->tuplestore = tuplestore_begin_heap(false, true, work_mem);
+ p1->slot = MakeSingleTupleTableSlot(fsstate->tupdesc, &TTSOpsMinimalTuple);

The slot doesn't seem to be used anywhere in the code that follows. Is
there a reason to initialize it here, rather than wherever it's
needed? If so, maybe the comment could mention that.

+ i = 0;

I don't think this does anything, because you reinitialize i in the
inner loop where you use it. In general, try moving your variable
declarations to the innermost scope where they are needed. It makes
the code more clear and allows the compiler to tell you about stuff
like this.

+ else if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ {
+ while (res != NULL)
+ res = pgfdw_get_result(conn);
+ break;
+ }

I doubt that that this is right. It seems like it's willing to throw
away an infinite number of result sets without doing anything with
them, or discard an infinite number of errors without processing them,
but that doesn't seem likely to be the right thing.

+ else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ {
+ clear_conn_state(fsstate->conn_state);
+ pgfdw_report_error(res, conn, fsstate->query);
+ }

I also doubt that this is right. If it's necessary to call
clear_conn_state() before reporting this error, then the error
handling in this patch is probably wrong. We don't know where errors
will be thrown in general and cannot rely on being able to do manual
error cleanup before an error occurs. The goal should for it to be
safe for pgfdw_report_error -- or just ereport(ERROR, ...) -- to
happen anywhere without causing problems in the future.

clear_conn_state() doesn't look right either. It does some cleanup on
each element of the prev_queries list and decrements num_prev_queries,
but it doesn't actually remove the queries from the list. So after
calling this function the first time, I think that num_prev_queries
might end up being 0 while prev_queries is still a list of the same
length as before. After that, I imagine calling clear_conn_state() a
second time would result in num_prev_queries going negative. I don't
really understand why num_prev_queries exists -- it seems like it's
just a recipe for mistakes to have both the list itself, and a
separate field that gives you the list length. If you need that, the
existing List data type will do that bookkeeping for you, and then
there's less opportunity for mistakes.

Overall, I think the direction of the patch set has some promise, but
I think it needs a lot of cleanup: removal of unnecessary code, proper
formatting, moving variables to inner scopes, explanatory comments,
good names for variables and functions and structure members, removal
of unnecessary files from the patch, cleanup of the regression test
coverage so that it doesn't add more bloat than necessary, proper
choice of data structures, and so on. Right now, the good things that
you've done here are being hidden by these sorts of mechanical issues.
That's not just an issue for me as a reviewer: I suspect it's also
blocking you, as the patch author, from finding places where the code
could be made better. Being able to find such opportunities for
improvement and act on them is what will get this patch from
"interesting proof of concept" to "potentially committable patch".

--
Robert Haas
EDB: http://www.enterprisedb.com