Обсуждение: Asynchronous MergeAppend

Поиск
Список
Период
Сортировка

Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Hello.

I'd like to make MergeAppend node Async-capable like Append node. 
Nowadays when planner chooses MergeAppend plan, asynchronous execution 
is not possible. With attached patches you can see plans like

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
  Merge Append
    Sort Key: async_pt.b, async_pt.a
    ->  Async Foreign Scan on public.async_p1 async_pt_1
          Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
          Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
    ->  Async Foreign Scan on public.async_p2 async_pt_2
          Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
          Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % 
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST

This can be quite profitable (in our test cases you can gain up to two 
times better speed with MergeAppend async execution on remote servers).

Code for asynchronous execution in Merge Append was mostly borrowed from 
Append node.

What significantly differs - in ExecMergeAppendAsyncGetNext() you must 
return tuple from the specified slot.
Subplan number determines tuple slot where data should be retrieved to. 
When subplan is ready to provide some data,
it's cached in ms_asyncresults. When we get tuple for subplan, specified 
in ExecMergeAppendAsyncGetNext(),
ExecMergeAppendAsyncRequest() returns true and loop in 
ExecMergeAppendAsyncGetNext() ends. We can fetch data for
subplans which either don't have cached result ready or have already 
returned them to the upper node. This
flag is stored in ms_has_asyncresults. As we can get data for some 
subplan either earlier or after loop in ExecMergeAppendAsyncRequest(),
we check this flag twice in this function.
Unlike ExecAppendAsyncEventWait(), it seems 
ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no 
need to get result
from synchronous subplan if a tuple form async one was explicitly 
requested.

Also we had to fix postgres_fdw to avoid directly looking at Append 
fields. Perhaps, accesors to Append fields look strange, but allows
to avoid some code duplication. I suppose, duplication could be even 
less if we reworked async Append implementation, but so far I haven't
tried to do this to avoid big diff from master.

Also mark_async_capable() believes that path corresponds to plan. This 
can be not true when create_[merge_]append_plan() inserts sort node.
In this case mark_async_capable() can treat Sort plan node as some other 
and crash, so there's a small fix for this.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
Alena Rybakina
Дата:
Hi! Thank you for your work on this subject! I think this is a very 
useful optimization)

While looking through your code, I noticed some points that I think 
should be taken into account. Firstly, I noticed only two tests to 
verify the functionality of this function and I think that this is not 
enough.
Are you thinking about adding some tests with queries involving, for 
example, join connections with different tables and unusual operators?

In addition, I have a question about testing your feature on a 
benchmark. Are you going to do this?

On 17.07.2024 16:24, Alexander Pyhalov wrote:
> Hello.
>
> I'd like to make MergeAppend node Async-capable like Append node. 
> Nowadays when planner chooses MergeAppend plan, asynchronous execution 
> is not possible. With attached patches you can see plans like
>
> EXPLAIN (VERBOSE, COSTS OFF)
> SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
>                                                           QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------

>
>  Merge Append
>    Sort Key: async_pt.b, async_pt.a
>    ->  Async Foreign Scan on public.async_p1 async_pt_1
>          Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
>          Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 
> 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
>    ->  Async Foreign Scan on public.async_p2 async_pt_2
>          Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
>          Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % 
> 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
>
> This can be quite profitable (in our test cases you can gain up to two 
> times better speed with MergeAppend async execution on remote servers).
>
> Code for asynchronous execution in Merge Append was mostly borrowed 
> from Append node.
>
> What significantly differs - in ExecMergeAppendAsyncGetNext() you must 
> return tuple from the specified slot.
> Subplan number determines tuple slot where data should be retrieved 
> to. When subplan is ready to provide some data,
> it's cached in ms_asyncresults. When we get tuple for subplan, 
> specified in ExecMergeAppendAsyncGetNext(),
> ExecMergeAppendAsyncRequest() returns true and loop in 
> ExecMergeAppendAsyncGetNext() ends. We can fetch data for
> subplans which either don't have cached result ready or have already 
> returned them to the upper node. This
> flag is stored in ms_has_asyncresults. As we can get data for some 
> subplan either earlier or after loop in ExecMergeAppendAsyncRequest(),
> we check this flag twice in this function.
> Unlike ExecAppendAsyncEventWait(), it seems 
> ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no 
> need to get result
> from synchronous subplan if a tuple form async one was explicitly 
> requested.
>
> Also we had to fix postgres_fdw to avoid directly looking at Append 
> fields. Perhaps, accesors to Append fields look strange, but allows
> to avoid some code duplication. I suppose, duplication could be even 
> less if we reworked async Append implementation, but so far I haven't
> tried to do this to avoid big diff from master.
>
> Also mark_async_capable() believes that path corresponds to plan. This 
> can be not true when create_[merge_]append_plan() inserts sort node.
> In this case mark_async_capable() can treat Sort plan node as some 
> other and crash, so there's a small fix for this.

I think you should add this explanation to the commit message because 
without it it's hard to understand the full picture of how your code works.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Hi.

I've updated patches for asynchronous merge append. They allowed us to 
significantly improve performance in practice. Earlier select from 
partitioned (and distributed table) could switch to synchronous merge 
append plan from asynchronous append. Given that table could have 20+ 
partitions, it was cheaper, but much less efficient due to remote parts 
executing synchronously.

In this version there's a couple of small fixes - earlier 
ExecMergeAppend() scanned all asyncplans, but should do this only for 
valid asyncplans. Also incorporated logic from

commit af717317a04f5217728ce296edf4a581eb7e6ea0
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Wed Mar 12 20:53:09 2025 +0200

     Handle interrupts while waiting on Append's async subplans

into ExecMergeAppendAsyncEventWait().

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
Álvaro Herrera
Дата:
I noticed that this patch has gone largely unreviewed, but it needs
rebase due to the GUC changes, so here it is again.

Thanks

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Вложения

Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
Hi, thanks for working on this!

On Tue Aug 20, 2024 at 6:14 AM -03, Alexander Pyhalov wrote:
>> In addition, I have a question about testing your feature on a
>> benchmark. Are you going to do this?
>>
>
> The main reason for this work is a dramatic performance degradation when
> Append plans with async foreign scan nodes are switched to MergeAppend
> plans with synchronous foreign scans.
>
> I've performed some synthetic tests to prove the benefits of async Merge
> Append. So far tests are performed on one physical host.
>
> For tests I've deployed 3 PostgreSQL instances on ports 5432-5434.
>
> The first instance:
> create server s2 foreign data wrapper postgres_fdw OPTIONS ( port
> '5433', dbname 'postgres', async_capable 'on');
> create server s3 foreign data wrapper postgres_fdw OPTIONS ( port
> '5434', dbname 'postgres', async_capable 'on');
>
> create foreign table players_p1 partition of players for values with
> (modulus 4, remainder 0) server s2;
> create foreign table players_p2 partition of players for values with
> (modulus 4, remainder 1) server s2;
> create foreign table players_p3 partition of players for values with
> (modulus 4, remainder 2) server s3;
> create foreign table players_p4 partition of players for values with
> (modulus 4, remainder 3) server s3;
>
> s2 instance:
> create table players_p1  (id int, name text, score int);
> create table players_p2  (id int, name text, score int);
> create index on players_p1(score);
> create index on players_p2(score);
>
> s3 instance:
> create table players_p3  (id int, name text, score int);
> create table players_p4  (id int, name text, score int);
> create index on players_p3(score);
> create index on players_p4(score);
>
> s1 instance:
> insert into players select i, 'player_' ||i, random()* 100 from
> generate_series(1,100000) i;
>
> pgbench script:
> \set rnd_offset random(0,200)
> \set rnd_limit  random(10,20)
>
> select * from players order by score desc offset :rnd_offset limit
> :rnd_limit;
>
> pgbench was run as:
> pgbench -n -f 1.sql  postgres -T 100 -c 16 -j 16
>
> CPU idle was about 5-10%.
>
> pgbench results:
>
> [...]
> However, if we set number of threads to 1, so that CPU has idle cores,
> we'll see more evident improvements:
>
> Patched, async_capable on:
> pgbench (14.13, server 18devel)
> transaction type: 1.sql
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 20221
> latency average = 4.945 ms
> initial connection time = 7.035 ms
> tps = 202.221816 (without initial connection time)
>
>
> Patched, async_capable off
> transaction type: 1.sql
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 14941
> latency average = 6.693 ms
> initial connection time = 7.037 ms
> tps = 149.415688 (without initial connection time)
>
I ran some benchmarks based on v4 attached by Alvaro in [1] using a
smaller number of threads so that some CPU cores would be idle and I
also obtained better results:

Patched, async_capable on:
tps = 4301.567405

Master, async_capable on:
tps = 3847.084545

So I'm +1 for the idea. I know it's been while since the last patch, and
unfortunully it hasn't received reviews since then. Do you still plan to
work on it? I still need to take a look on the code to see if I can help
with some comments.

During the tests I got compiler errors due to fce7c73fba4, so I'm
attaching a v5 with guc_parameters.dat correctly sorted.

The postgres_fdw/regress tests was also failling due to some whitespace
problems, v5 also fix this.

[1] https://www.postgresql.org/message-id/202510251154.isknefznk566%40alvherre.pgsql

--
Matheus Alcantara



Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Hi.

Matheus Alcantara писал(а) 2025-11-03 16:00:
> So I'm +1 for the idea. I know it's been while since the last patch,
> and
> unfortunully it hasn't received reviews since then. Do you still plan
> to
> work on it? I still need to take a look on the code to see if I can
> help
> with some comments.
>
> During the tests I got compiler errors due to fce7c73fba4, so I'm
> attaching a v5 with guc_parameters.dat correctly sorted.
>
> The postgres_fdw/regress tests was also failling due to some whitespace
> problems, v5 also fix this.
>
> [1]
> https://www.postgresql.org/message-id/202510251154.isknefznk566%40alvherre.pgsql
>

I'm still interested in working on this patch, but it didn't get any
review (besides internal one). I suppose, Append and MergeAppend nodes
need some unification, for example, ExecAppendAsyncEventWait and
ExecMergeAppendAsyncEventWait looks the same, both
classify_matching_subplans() versions are suspiciously similar. But
honestly, patch needs thorough review.
--
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Wed Nov 5, 2025 at 3:30 AM -03, Alexander Pyhalov wrote:
>> So I'm +1 for the idea. I know it's been while since the last patch,
>> and unfortunully it hasn't received reviews since then. Do you still
>> plan to work on it? I still need to take a look on the code to see if
>> I can help with some comments.
>
> I'm still interested in working on this patch, but it didn't get any
> review (besides internal one). I suppose, Append and MergeAppend nodes
> need some unification, for example, ExecAppendAsyncEventWait and
> ExecMergeAppendAsyncEventWait looks the same, both
> classify_matching_subplans() versions are suspiciously similar. But
> honestly, patch needs thorough review.
>
Here are some comments on my first look at the patches. I still don't
have too much experience with the executor code but I hope that I can
help with something.

v5-0001-mark_async_capable-subpath-should-match-subplan.patch

I don't have to much comments on this, perhaps we could have a commit
message explaining the reason behind the change.

----

v5-0002-MergeAppend-should-support-Async-Foreign-Scan-sub.patch

The AppendState struct has the "as_syncdone", this field is not needed
on MergeAppendState?

----
Regarding the duplicated code on classify_matching_subplans I think that
we can have a more generic function that operates over function
parameters, something like this:

/*
 * Classify valid subplans into sync and async groups.
 *
 * It calculates the intersection of *valid_subplans and *asyncplans,
 * stores the result in *valid_asyncplans, and removes those members
 * from *valid_subplans (leaving only sync plans).
 *
 * Returns true if valid async plans were found, false otherwise.
 */
static bool
classify_subplans_internal(Bitmapset **valid_subplans,
                       Bitmapset *asyncplans,
                       Bitmapset **valid_asyncplans);

----
The GetValidAsyncplans() is not being used

----
We have some reduction of code coverage on nodeMergeAppend.c. The
significant blocks are on ExecMergeAppendAsyncBegin():
+    /* If we've yet to determine the valid subplans then do so now. */
+    if (!node->ms_valid_subplans_identified)
+    {
+        node->ms_valid_subplans =
+            ExecFindMatchingSubPlans(node->ms_prune_state, false, NULL);
+        node->ms_valid_subplans_identified = true;
+
+        classify_matching_subplans(node);
+    }

And there are some blocks on ExecReScanMergeAppend(). It's worth adding
a test case for then? I'm not sure how hard would be to write a
regression test that cover these blocks.

----
I agree that duplicated code is not good but it seems to me that we
already have some code on nodeMergeAppend.c borrowed from nodeAppend.c
even without you patch, for example the ExecInitMergeAppend(),
ExecReScanMergeAppend() and partially ExecMergeAppend().

Although nodeMergeAppend.c and nodeAppend.c have similar functions ,
some difference exists and I'm wondering if we should wait for the rule
of three [1] to refactor these duplicated code?

[1] https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)

--
Matheus Alcantara
EDB: http://www.enterprisedb.com




Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Matheus Alcantara писал(а) 2025-11-12 00:00:
> On Wed Nov 5, 2025 at 3:30 AM -03, Alexander Pyhalov wrote:
>>> So I'm +1 for the idea. I know it's been while since the last patch,
>>> and unfortunully it hasn't received reviews since then. Do you still
>>> plan to work on it? I still need to take a look on the code to see if
>>> I can help with some comments.
>> 
>> I'm still interested in working on this patch, but it didn't get any
>> review (besides internal one). I suppose, Append and MergeAppend nodes
>> need some unification, for example, ExecAppendAsyncEventWait and
>> ExecMergeAppendAsyncEventWait looks the same, both
>> classify_matching_subplans() versions are suspiciously similar. But
>> honestly, patch needs thorough review.
>> 

Hi.
Thanks for review.

> Here are some comments on my first look at the patches. I still don't
> have too much experience with the executor code but I hope that I can
> help with something.
> 
> v5-0001-mark_async_capable-subpath-should-match-subplan.patch
> 
> I don't have to much comments on this, perhaps we could have a commit
> message explaining the reason behind the change.

I've expanded commit message. The issue is that mark_async_capable() 
relies
on the fact that plan node type corresponds to path type. This is not 
true when
(Merge)Append decides to insert Sort node.

> 
> ----
> 
> v5-0002-MergeAppend-should-support-Async-Foreign-Scan-sub.patch
> 
> The AppendState struct has the "as_syncdone", this field is not needed
> on MergeAppendState?

We don't need as_syncdone. Async Append fetches tuple from async subplan 
and waits for them either when they have some data or when there's no 
more sync subplans (as we can return any tuple we receive from 
subplans). But ExecMergeAppend should decide which tuple to return based 
on sort order, so there's no need to remember if we are done with sync 
subplans, as subplans ordering matters, and we can't arbitrary switch 
between them.


> Regarding the duplicated code on classify_matching_subplans I think 
> that
> we can have a more generic function that operates over function
> parameters

I've tried to do so, but there are two issues. There's no suitable 
common header between
nodAppend and nodeMergeAppend. I've put 
classify_matching_subplans_common() into src/include/nodes/execnodes.h
and sure that it's not the best choice. The second issue is with 
as_syncdone, it exists only in AppendState, so
we should check for empty valid_subplans separately. In fact, there's 3 
outcomes for Append 1) no sync plans,
no async plans,  2) no async plans, 3) async plans present, and only 
last two states have meaning
for MergeAppend.

> The GetValidAsyncplans() is not being used

As well as GetValidAsyncRequest(), these parts are used by our FDW, so 
they slipped in the patch. Removed them.

> 
> ----
> We have some reduction of code coverage on nodeMergeAppend.c. The
> significant blocks are on ExecMergeAppendAsyncBegin():
> +    /* If we've yet to determine the valid subplans then do so now. */
> +    if (!node->ms_valid_subplans_identified)
> +    {
> +        node->ms_valid_subplans =
> +            ExecFindMatchingSubPlans(node->ms_prune_state, false, NULL);
> +        node->ms_valid_subplans_identified = true;
> +
> +        classify_matching_subplans(node);
> +    }
> 
> And there are some blocks on ExecReScanMergeAppend(). It's worth adding
> a test case for then? I'm not sure how hard would be to write a
> regression test that cover these blocks.
> 

You are right. There's difference between ExecAppend and 
ExecMergeAppend. Append identifies valid subplans in 
ExecAppendAsyncBegin. MergeAppend - earlier, in ExecMergeAppend(). So 
this is really the dead code. And there was an issue with it, which 
became evident when I've added test for rescan. When we've identified 
new subplans in ExecMergeAppend(), we have to classify them.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
Matheus Alcantara
Дата:
On Sat Nov 15, 2025 at 7:57 AM -03, Alexander Pyhalov wrote:
>> Here are some comments on my first look at the patches. I still don't
>> have too much experience with the executor code but I hope that I can
>> help with something.
>>
>> v5-0001-mark_async_capable-subpath-should-match-subplan.patch
>>
>> I don't have to much comments on this, perhaps we could have a commit
>> message explaining the reason behind the change.
>
> I've expanded commit message. The issue is that mark_async_capable()
> relies
> on the fact that plan node type corresponds to path type. This is not
> true when
> (Merge)Append decides to insert Sort node.
>
Your explanation about why this change is needed that you've include on
your first email sounds more clear IMHO. I would suggest the following
for a commit message:
    mark_async_capable() believes that path corresponds to plan. This is
    not true when create_[merge_]append_plan() inserts sort node. In
    this case mark_async_capable() can treat Sort plan node as some
    other and crash. Fix this by handling the Sort node separately.

    This is needed to make MergeAppend node async-capable that it will
    be implemented in a next commit.

What do you think?

I was reading the patch changes again and I have a minor point:

                                SubqueryScan *scan_plan = (SubqueryScan *) plan;

                                /*
-                                * If the generated plan node includes
a gating Result node,
-                                * we can't execute it asynchronously.
+                                * If the generated plan node includes
a gating Result node or
+                                * a Sort node, we can't execute it
asynchronously.
                                 */
-                               if (IsA(plan, Result))
+                               if (IsA(plan, Result) || IsA(plan, Sort))

Shouldn't we cast the plan to SubqueryScan* after the IsA(...) check? I
know this code has been before your changes but type casting before a
IsA() check sounds a bit strange to me. Also perhaps we could add an
Assert(IsA(plan, SubqueryScan)) after the IsA(...) check and before the
type casting just for sanity?

>> ----
>>
>> v5-0002-MergeAppend-should-support-Async-Foreign-Scan-sub.patch
>>
>> The AppendState struct has the "as_syncdone", this field is not needed
>> on MergeAppendState?
>
> We don't need as_syncdone. Async Append fetches tuple from async subplan
> and waits for them either when they have some data or when there's no
> more sync subplans (as we can return any tuple we receive from
> subplans). But ExecMergeAppend should decide which tuple to return based
> on sort order, so there's no need to remember if we are done with sync
> subplans, as subplans ordering matters, and we can't arbitrary switch
> between them.
>
Ok, thanks for the explanation.

>
>> Regarding the duplicated code on classify_matching_subplans I think
>> that
>> we can have a more generic function that operates over function
>> parameters
>
> I've tried to do so, but there are two issues. There's no suitable
> common header between
> nodAppend and nodeMergeAppend. I've put
> classify_matching_subplans_common() into src/include/nodes/execnodes.h
> and sure that it's not the best choice. The second issue is with
> as_syncdone, it exists only in AppendState, so
> we should check for empty valid_subplans separately. In fact, there's 3
> outcomes for Append 1) no sync plans,
> no async plans,  2) no async plans, 3) async plans present, and only
> last two states have meaning
> for MergeAppend.
>
I think that's ok to have these separated checks on nodeAppend.c and
nodeMergeAppend.c once the majority of duplicated steps that would be
required is centralized into a single reusable function.

I also agree that execnodes.h may not be the best place to declare this
function but I also don't have too many ideas of where to put it. Let's
see if we have more comments on this.

>> ----
>> We have some reduction of code coverage on nodeMergeAppend.c. The
>> significant blocks are on ExecMergeAppendAsyncBegin():
>> +    /* If we've yet to determine the valid subplans then do so now. */
>> +    if (!node->ms_valid_subplans_identified)
>> +    {
>> +            node->ms_valid_subplans =
>> +                    ExecFindMatchingSubPlans(node->ms_prune_state, false, NULL);
>> +            node->ms_valid_subplans_identified = true;
>> +
>> +            classify_matching_subplans(node);
>> +    }
>>
>> And there are some blocks on ExecReScanMergeAppend(). It's worth adding
>> a test case for then? I'm not sure how hard would be to write a
>> regression test that cover these blocks.
>>
>
> You are right. There's difference between ExecAppend and
> ExecMergeAppend. Append identifies valid subplans in
> ExecAppendAsyncBegin. MergeAppend - earlier, in ExecMergeAppend(). So
> this is really the dead code. And there was an issue with it, which
> became evident when I've added test for rescan. When we've identified
> new subplans in ExecMergeAppend(), we have to classify them.
>
Thanks, the code coverage looks better now.

I plan to do another round of review on 0002, in the meantime I'm
sharing these comments for now.

--
Matheus Alcantara
EDB: http://www.enterprisedb.com



Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Hi.

Matheus Alcantara писал(а) 2025-11-18 00:09:
> On Sat Nov 15, 2025 at 7:57 AM -03, Alexander Pyhalov wrote:
>>> Here are some comments on my first look at the patches. I still don't
>>> have too much experience with the executor code but I hope that I can
>>> help with something.
>>>
>>> v5-0001-mark_async_capable-subpath-should-match-subplan.patch
>>>
>>> I don't have to much comments on this, perhaps we could have a commit
>>> message explaining the reason behind the change.
>>
>> I've expanded commit message. The issue is that mark_async_capable()
>> relies
>> on the fact that plan node type corresponds to path type. This is not
>> true when
>> (Merge)Append decides to insert Sort node.
>>
> Your explanation about why this change is needed that you've include on
> your first email sounds more clear IMHO. I would suggest the following
> for a commit message:
>     mark_async_capable() believes that path corresponds to plan. This
> is
>     not true when create_[merge_]append_plan() inserts sort node. In
>     this case mark_async_capable() can treat Sort plan node as some
>     other and crash. Fix this by handling the Sort node separately.
>
>     This is needed to make MergeAppend node async-capable that it will
>     be implemented in a next commit.
>
> What do you think?
>

Seems to be OK.

> I was reading the patch changes again and I have a minor point:
>
>                                 SubqueryScan *scan_plan = (SubqueryScan
> *) plan;
>
>                                 /*
> -                                * If the generated plan node includes
> a gating Result node,
> -                                * we can't execute it asynchronously.
> +                                * If the generated plan node includes
> a gating Result node or
> +                                * a Sort node, we can't execute it
> asynchronously.
>                                  */
> -                               if (IsA(plan, Result))
> +                               if (IsA(plan, Result) || IsA(plan,
> Sort))
>
> Shouldn't we cast the plan to SubqueryScan* after the IsA(...) check? I
> know this code has been before your changes but type casting before a
> IsA() check sounds a bit strange to me. Also perhaps we could add an
> Assert(IsA(plan, SubqueryScan)) after the IsA(...) check and before the
> type casting just for sanity?

Yes, checking for node not to be A and then using it as B seems to be
strange. But casting to another type and checking if node is of a
particular type before using seems to be rather common. It doesn't do
any harm if we don't actually refer to SubqueryScan fields.

Updated the first patch.


--
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Tue Nov 18, 2025 at 4:14 AM -03, Alexander Pyhalov wrote:
> Updated the first patch.
>
Thanks for the new version. Some new comments.

v7-0002-MergeAppend-should-support-Async-Foreign-Scan-subpla.patch:

1. Should be "nasyncplans" instead of "nplans"? ExecInitAppend use
"nasyncplans" to allocate the as_asyncresults array.

+        mergestate->ms_asyncresults = (TupleTableSlot **)
+            palloc0(nplans * sizeof(TupleTableSlot *));
+

2. I think that this comment should be updated. IIUC ms_valid_subplans
may not be all subplans because classify_matching_subplans() may move
async ones to ms_valid_asyncplans. Is that right?

/*
 * If we've yet to determine the valid subplans then do so now.  If
 * run-time pruning is disabled then the valid subplans will always be
 * set to all subplans.
 */

3. This code comment should also mention the Assert(!bms_is_member(...));?

+    /* The result should be a TupleTableSlot or NULL. */
+    Assert(slot == NULL || IsA(slot, TupleTableSlot));
+    Assert(!bms_is_member(areq->request_index, node->ms_has_asyncresults));

4. It's worth include bms_num_members(node->ms_needrequest) <= 0 check
on ExecMergeAppendAsyncRequest() as an early return? IIUC it would avoid
the bms_is_member(), bms_next_member() and bms_is_empty(needrequest)
calls.

ExecMergeAppendAsyncRequest(MergeAppendState *node, int mplan)
        Bitmapset  *needrequest;
        int                     i;

+       if (bms_num_members(node->ms_needrequest) <= 0)
+               return false;
+

I'm attaching a diff with some cosmetic changes of indentation and
comments. Feel free to include on the patch or not.

--
Matheus Alcantara
EDB: http://www.enterprisedb.com


Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Matheus Alcantara писал(а) 2025-11-20 00:51:
> On Tue Nov 18, 2025 at 4:14 AM -03, Alexander Pyhalov wrote:
>> Updated the first patch.
>>
> Thanks for the new version. Some new comments.
>
> v7-0002-MergeAppend-should-support-Async-Foreign-Scan-subpla.patch:
>
> 1. Should be "nasyncplans" instead of "nplans"? ExecInitAppend use
> "nasyncplans" to allocate the as_asyncresults array.
>
> +        mergestate->ms_asyncresults = (TupleTableSlot **)
> +            palloc0(nplans * sizeof(TupleTableSlot *));
> +
>

No. There's a difference between how Append and MergeAppend handle async
results.

When Append looks for the next result, it can return any of them.
So, async results are not ordered in Append.
We have maximum nasyncplans of async results and return the first
available result
when we asked for one . For example, in ExecAppendAsyncRequest():

1004         /*
1005          * If there are any asynchronously-generated results that
have not yet
1006          * been returned, we have nothing to do; just return one of
them.
1007          */
1008         if (node->as_nasyncresults > 0)
1009         {
1010                 --node->as_nasyncresults;
1011                 *result =
node->as_asyncresults[node->as_nasyncresults];
1012                 return true;
1013         }

ExecAppendAsyncGetNext() looks (via ExecAppendAsyncRequest()) on any
result.

However, when we are asked for result in MergeAppend, we should return
result of
the specific subplan. To achieve this we should know, which subplan
given results correspond to.
So, we enumerate async results in the same way as requests (or
ms_valid_asyncplans).
Look at ExecAppendAsyncGetNext()/ExecAppendAsyncRequest().

> 2. I think that this comment should be updated. IIUC ms_valid_subplans
> may not be all subplans because classify_matching_subplans() may move
> async ones to ms_valid_asyncplans. Is that right?
>
> /*
>  * If we've yet to determine the valid subplans then do so now.  If
>  * run-time pruning is disabled then the valid subplans will always be
>  * set to all subplans.
>  */
>

Yes, you are correct, and similar comment in nodeAppend.c lacks the last
sentence.
Removed it.

> 3. This code comment should also mention the
> Assert(!bms_is_member(...));?
>
> +    /* The result should be a TupleTableSlot or NULL. */
> +    Assert(slot == NULL || IsA(slot, TupleTableSlot));
> +    Assert(!bms_is_member(areq->request_index,
> node->ms_has_asyncresults));
>

> 4. It's worth include bms_num_members(node->ms_needrequest) <= 0 check
> on ExecMergeAppendAsyncRequest() as an early return? IIUC it would
> avoid
> the bms_is_member(), bms_next_member() and bms_is_empty(needrequest)
> calls.

We can't exclude the first bms_is_member(), as node->ms_needrequest can
be empty
(we've already got result), so do not need to do request to get it, just
return previously fetched result.

Not sure about check above the following lines:

650         i = -1;
651         while ((i = bms_next_member(node->ms_needrequest, i)) >= 0)
652         {
653                 if (!bms_is_member(i, node->ms_has_asyncresults))
654                         needrequest = bms_add_member(needrequest,
i);
655         }
656

I think, it shouldn't be much cheaper as bms_next_member() will execute
a couple instructions
to find out that the number of words in bitmapset is zero, but will do
nothing expensive.

>
> ExecMergeAppendAsyncRequest(MergeAppendState *node, int mplan)
>         Bitmapset  *needrequest;
>         int                     i;
>
> +       if (bms_num_members(node->ms_needrequest) <= 0)
> +               return false;
> +
>

No, as I've mentioned, we can't exclude bms_is_member(mplan,
node->ms_has_asyncresults) check.
We could have received result while waiting for data for another
subplan.

Let's assume, we have 2 async subplans (0 and 1). For example, we've
decided
to get data from subplan 1. We 've already send requests to both async
subplans (in ExecMergeAppendAsyncBegin() or later).
Now we do ExecMergeAppendAsyncRequest(), but there's no subplans, which
need request.
So we enter the waiting loop. Let's assume we got event for another
subplan (0). Via ExecAsyncNotify(),
ExecAsyncForeignScanNotify() we go to postgresForeignAsyncNotify(),
fetch data and mark request as complete.
Via ExecAsyncMergeAppendResponse() we save results for subplan 0 in
node->ms_asyncresults[0]. When we finally
got result for subplan 1, we do the same, but now exit the loop. When
MergeAppend finally decides that it needs
results from subplan 0, we already have them, but ms_needrequest is
empty, so  ExecMergeAppendAsyncRequest()
just returns this pre-fetched tuple.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
Thanks for the new version. I don't have other comments on the current
state of the patch. It seems to working as expected and we have
performance improvements that I think that it make it worthwhile.

I have just a small comment on 0002:

+    noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ , occurred_event,
+                                 nevents, WAIT_EVENT_APPEND_READY);

Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
event for merge append?

I've spent some time thinking about how we could remove some parts of
the duplicated code that you've previously mention. I think that we
could try to create something like we already have for relation scan
operations, that we have execScan.c that is used for example by
nodeSeqScan.c and nodeIndexScan.c. The attached patch 0003 is a draft
that I attempt to implement this idea. The 0001 and 0002 remains the
same as the previous version. The 0003 was build on top of these.

I've created Appender and AppenderState types that are used by
Append/MergeAppend and AppendState/MergeAppendState respectively (I
should have think in a better name for these base type, ideas are
welcome). The execAppend.c was created to have the functions that can be
reused by Append and MergeAppend execution. I've tried to remove
duplicated code blocks that was almost the same and that didn't require
much refactoring.

I think that there still more opportunities to remove similar code
blocks that for example are on ExecMergeAppendAsyncGetNext and
ExecAppendAsyncGetNext but it require refactoring.

Thoughts?

--
Matheus Alcantara
EDB: http://www.enterprisedb.com


Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Hi.

Matheus Alcantara писал(а) 2025-12-17 23:01:
> Thanks for the new version. I don't have other comments on the current
> state of the patch. It seems to working as expected and we have
> performance improvements that I think that it make it worthwhile.
> 
> I have just a small comment on 0002:
> 
> +    noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ , 
> occurred_event,
> +                                 nevents, WAIT_EVENT_APPEND_READY);
> 
> Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
> event for merge append?

I'm not sure that new wait event is needed - for observability I think 
it's not critical
to distinguish Append and MergeAppend when they waited for foreign 
scans.  But also it's perhaps
doesn't do any harm to record specific wait event.

> 
> I've spent some time thinking about how we could remove some parts of
> the duplicated code that you've previously mention. I think that we
> could try to create something like we already have for relation scan
> operations, that we have execScan.c that is used for example by
> nodeSeqScan.c and nodeIndexScan.c. The attached patch 0003 is a draft
> that I attempt to implement this idea. The 0001 and 0002 remains the
> same as the previous version. The 0003 was build on top of these.
> 
> I've created Appender and AppenderState types that are used by
> Append/MergeAppend and AppendState/MergeAppendState respectively (I
> should have think in a better name for these base type, ideas are
> welcome). The execAppend.c was created to have the functions that can 
> be
> reused by Append and MergeAppend execution. I've tried to remove
> duplicated code blocks that was almost the same and that didn't require
> much refactoring.

Overall I like new Appender node. Splitting code in this way really 
helps to avoid code duplication.
However, some similar code is still needed, because logic of getting new 
tuples is different.

Some minor issues I've noticed.
1) ExecReScanAppender()  sets node->needrequest to NULL. 
ExecReScanAppend() calls bms_free(node->as.needrequest) immediately 
after this. The same is true for ExecReScanMergeAppend(). We should move 
it to ExecReScanAppender().

2) In src/backend/executor/execAppend.c:
planstates are named  as mergeplans in ExecEndAppender(), perhaps, 
appendplans or subplans are better names.

ExecInitAppender() could use palloc_array() to allocate appendplanstates 
- as ExecInitMergeAppend().


> 
> I think that there still more opportunities to remove similar code
> blocks that for example are on ExecMergeAppendAsyncGetNext and
> ExecAppendAsyncGetNext but it require refactoring.
> 
> Thoughts?
> 
> --
> Matheus Alcantara
> EDB: http://www.enterprisedb.com

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Thu Dec 18, 2025 at 6:56 AM -03, Alexander Pyhalov wrote:
>> +    noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ ,
>> occurred_event,
>> +                                 nevents, WAIT_EVENT_APPEND_READY);
>>
>> Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
>> event for merge append?
>
> I'm not sure that new wait event is needed - for observability I think
> it's not critical
> to distinguish Append and MergeAppend when they waited for foreign
> scans.  But also it's perhaps
> doesn't do any harm to record specific wait event.
>
Ok, I think that we can keep this way for now and let's see if a new
wait event is really needed.

>> I've created Appender and AppenderState types that are used by
>> Append/MergeAppend and AppendState/MergeAppendState respectively (I
>> should have think in a better name for these base type, ideas are
>> welcome). The execAppend.c was created to have the functions that can
>> be
>> reused by Append and MergeAppend execution. I've tried to remove
>> duplicated code blocks that was almost the same and that didn't require
>> much refactoring.
>
> Overall I like new Appender node. Splitting code in this way really
> helps to avoid code duplication.
> However, some similar code is still needed, because logic of getting new
> tuples is different.
>
Indeed.

> Some minor issues I've noticed.
> 1) ExecReScanAppender()  sets node->needrequest to NULL.
> ExecReScanAppend() calls bms_free(node->as.needrequest) immediately
> after this. The same is true for ExecReScanMergeAppend(). We should move
> it to ExecReScanAppender().
>
Fixed

> 2) In src/backend/executor/execAppend.c:
> planstates are named  as mergeplans in ExecEndAppender(), perhaps,
> appendplans or subplans are better names.
>
Fixed

> ExecInitAppender() could use palloc_array() to allocate appendplanstates
> - as ExecInitMergeAppend().
>
Fixed

--
Matheus Alcantara
EDB: http://www.enterprisedb.com


Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Matheus Alcantara писал(а) 2025-12-19 16:45:
> On Thu Dec 18, 2025 at 6:56 AM -03, Alexander Pyhalov wrote:
>>> +    noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */ 
>>> ,
>>> occurred_event,
>>> +                                 nevents, WAIT_EVENT_APPEND_READY);
>>> 
>>> Should we use the same WAIT_EVENT_APPEND_READY or create a new wait
>>> event for merge append?
>> 
>> I'm not sure that new wait event is needed - for observability I think
>> it's not critical
>> to distinguish Append and MergeAppend when they waited for foreign
>> scans.  But also it's perhaps
>> doesn't do any harm to record specific wait event.
>> 
> Ok, I think that we can keep this way for now and let's see if a new
> wait event is really needed.
> 
>>> I've created Appender and AppenderState types that are used by
>>> Append/MergeAppend and AppendState/MergeAppendState respectively (I
>>> should have think in a better name for these base type, ideas are
>>> welcome). The execAppend.c was created to have the functions that can
>>> be
>>> reused by Append and MergeAppend execution. I've tried to remove
>>> duplicated code blocks that was almost the same and that didn't 
>>> require
>>> much refactoring.
>> 
>> Overall I like new Appender node. Splitting code in this way really
>> helps to avoid code duplication.
>> However, some similar code is still needed, because logic of getting 
>> new
>> tuples is different.
>> 

Hi.

I've looked through updated patch. Tested it (also with our fdw). 
Overall looks good.

In execAppend.c there's still reference to as_valid_subplans. Also we 
could perhaps use palloc0_array() in some more places, for example, for 
for state->asyncrequests and state->asyncresults.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Tue Dec 23, 2025 at 5:50 AM -03, Alexander Pyhalov wrote:
> I've looked through updated patch. Tested it (also with our fdw).
> Overall looks good.
>
Thanks for testing.

> In execAppend.c there's still reference to as_valid_subplans. Also we
> could perhaps use palloc0_array() in some more places, for example, for
> for state->asyncrequests and state->asyncresults.
>
Fixed on the new attached version.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com


Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Matheus Alcantara писал(а) 2025-12-29 16:43:
> On Tue Dec 23, 2025 at 5:50 AM -03, Alexander Pyhalov wrote:
>> I've looked through updated patch. Tested it (also with our fdw).
>> Overall looks good.
>> 
> Thanks for testing.
> 
>> In execAppend.c there's still reference to as_valid_subplans. Also we
>> could perhaps use palloc0_array() in some more places, for example, 
>> for
>> for state->asyncrequests and state->asyncresults.
>> 
> Fixed on the new attached version.
> 
> --
> Matheus Alcantara
> EDB: https://www.enterprisedb.com

Hi.

Looks good. What do you think about classify_matching_subplans_common()? 
Should it stay where it is or should we hide it to
src/include/executor/execAppend.h ?
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:
> Looks good. What do you think about classify_matching_subplans_common()?
> Should it stay where it is or should we hide it to
> src/include/executor/execAppend.h ?
>
Yeah, sounds better to me to move classify_matching_subplans_common to
execAppend.h since it very specific for the MergeAppend and Append
execution. I've declared the function as static inline on execAppend.h
but I'm not sure if it's the best approach.

I've also wrote a proper commit message for this new version.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com


Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Matheus Alcantara писал(а) 2025-12-30 18:04:
> On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:
>> Looks good. What do you think about 
>> classify_matching_subplans_common()?
>> Should it stay where it is or should we hide it to
>> src/include/executor/execAppend.h ?
>> 
> Yeah, sounds better to me to move classify_matching_subplans_common to
> execAppend.h since it very specific for the MergeAppend and Append
> execution. I've declared the function as static inline on execAppend.h
> but I'm not sure if it's the best approach.
> 
> I've also wrote a proper commit message for this new version.

Looks good to me.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Alexander Pyhalov писал(а) 2025-12-30 22:04:
> Matheus Alcantara писал(а) 2025-12-30 18:04:
>> On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:
>>> Looks good. What do you think about 
>>> classify_matching_subplans_common()?
>>> Should it stay where it is or should we hide it to
>>> src/include/executor/execAppend.h ?
>>> 
>> Yeah, sounds better to me to move classify_matching_subplans_common to
>> execAppend.h since it very specific for the MergeAppend and Append
>> execution. I've declared the function as static inline on execAppend.h
>> but I'm not sure if it's the best approach.
>> 
>> I've also wrote a proper commit message for this new version.
> 
> Looks good to me.

Hi.
Rebased patches over fresh master.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Alexander Pyhalov писал(а) 2026-02-12 10:08:
> Alexander Pyhalov писал(а) 2025-12-30 22:04:
>> Matheus Alcantara писал(а) 2025-12-30 18:04:
>>> On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:
>>>> Looks good. What do you think about 
>>>> classify_matching_subplans_common()?
>>>> Should it stay where it is or should we hide it to
>>>> src/include/executor/execAppend.h ?
>>>> 
>>> Yeah, sounds better to me to move classify_matching_subplans_common 
>>> to
>>> execAppend.h since it very specific for the MergeAppend and Append
>>> execution. I've declared the function as static inline on 
>>> execAppend.h
>>> but I'm not sure if it's the best approach.
>>> 
>>> I've also wrote a proper commit message for this new version.
>> 
>> Looks good to me.
> 
> Hi.
> Rebased patches over fresh master.

Hi.
Rebased patches over fresh master.


-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Alexander Pyhalov писал(а) 2026-03-02 09:44:
> Alexander Pyhalov писал(а) 2026-02-12 10:08:
>> Alexander Pyhalov писал(а) 2025-12-30 22:04:
>>> Matheus Alcantara писал(а) 2025-12-30 18:04:
>>>> On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:
>>>>> Looks good. What do you think about 
>>>>> classify_matching_subplans_common()?
>>>>> Should it stay where it is or should we hide it to
>>>>> src/include/executor/execAppend.h ?
>>>>> 
>>>> Yeah, sounds better to me to move classify_matching_subplans_common 
>>>> to
>>>> execAppend.h since it very specific for the MergeAppend and Append
>>>> execution. I've declared the function as static inline on 
>>>> execAppend.h
>>>> but I'm not sure if it's the best approach.
>>>> 
>>>> I've also wrote a proper commit message for this new version.
>>> 
>>> Looks good to me.
>> 
>> Hi.
>> Rebased patches over fresh master.
> 
> Hi.
> Rebased patches over fresh master.

Hi.
Rebased patches over fresh master.

-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения

Re: Asynchronous MergeAppend

От
Alexander Korotkov
Дата:
Hi!

On Wed, Mar 18, 2026 at 8:17 AM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
> Alexander Pyhalov писал(а) 2026-03-02 09:44:
> > Alexander Pyhalov писал(а) 2026-02-12 10:08:
> >> Alexander Pyhalov писал(а) 2025-12-30 22:04:
> >>> Matheus Alcantara писал(а) 2025-12-30 18:04:
> >>>> On Tue Dec 30, 2025 at 10:15 AM -03, Alexander Pyhalov wrote:
> >>>>> Looks good. What do you think about
> >>>>> classify_matching_subplans_common()?
> >>>>> Should it stay where it is or should we hide it to
> >>>>> src/include/executor/execAppend.h ?
> >>>>>
> >>>> Yeah, sounds better to me to move classify_matching_subplans_common
> >>>> to
> >>>> execAppend.h since it very specific for the MergeAppend and Append
> >>>> execution. I've declared the function as static inline on
> >>>> execAppend.h
> >>>> but I'm not sure if it's the best approach.
> >>>>
> >>>> I've also wrote a proper commit message for this new version.
> >>>
> >>> Looks good to me.
> >>
> >> Hi.
> >> Rebased patches over fresh master.
> >
> > Hi.
> > Rebased patches over fresh master.
>
> Hi.
> Rebased patches over fresh master.

Thank you for your work on this subject.
I have revised the patchset.  I think it would be better if common
infrastructure goes first.  Otherwise we commit async merge append and
immediately revise it.  I also did some minor improvements.

------
Regards,
Alexander Korotkov
Supabase

Вложения

Re: Asynchronous MergeAppend

От
Matheus Alcantara
Дата:
Hi Alexander,

On 29/03/26 22:20, Alexander Korotkov wrote:
> Thank you for your work on this subject.
> I have revised the patchset.  I think it would be better if common
> infrastructure goes first.  Otherwise we commit async merge append and
> immediately revise it.  I also did some minor improvements.
> 

I was thinking about this but did not managed to spent time on it. 
Thanks for re-organizing the patches, it looks better and I think that 
it make more sense on this order.

I also agree with the minor improvements.

Thanks

--
Matheus Alcantara
EDB: https://www.enterprisedb.com



Re: Asynchronous MergeAppend

От
Alexander Korotkov
Дата:
Hi!

On Mon, Mar 30, 2026 at 3:25 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> On 29/03/26 22:20, Alexander Korotkov wrote:
> > Thank you for your work on this subject.
> > I have revised the patchset.  I think it would be better if common
> > infrastructure goes first.  Otherwise we commit async merge append and
> > immediately revise it.  I also did some minor improvements.
> >
>
> I was thinking about this but did not managed to spent time on it.
> Thanks for re-organizing the patches, it looks better and I think that
> it make more sense on this order.
>
> I also agree with the minor improvements.

I made more work on the patchset.

Patch #1 now considers IncrementalSort as exclusion alongside with
Sort.  Exclusion check is now on the top of the switch().
Patch #2 is split into 3 patches: common structures, common sync
append logic, and common async append logic.
New structs are now named AppendBase/AppendBaseState, corresponding
fields are "ab" and "as".

Most importantly I noted that this patchset actually only makes
initial heap filling asynchronous.  The steady work after that is
still syncnronous.  Even that it used async infrastructure, it fetched
tuples from children subplans one-by-one: effectively synchronous but
paying for asynchronous infrastructure.  I think even with this
limitation, this patchset is valuable: the startup cost for children
foreignscans can be high.  But this understanding allowed me to
significantly simplify the main patch including:
1) After initial heap filling, use ExecProcNode() to fetch from children plans.
2) Remove ms_has_asyncresults entirely. Async responses store directly
into ms_slots[] (the existing heap slot array), which serves as both
the merge state and the "result arrived" indicator via TupIsNull().
3) Removed needrequest usage from MergeAppend. Since MergeAppend only
fires initial requests (via ExecAppendBaseAsyncBegin()) and never
sends follow-up requests, needrequest tracking is unnecessary.
ExecMergeAppendAsyncRequest() was eliminated entirely.
4)  ExecMergeAppendAsyncGetNext() reduced to a simple wait loop:
5)  asyncresults allocation reduced back to nasyncplans.  MergeAppend
doesn't use it (stores in ms_slots), and Append only needs nasyncplans
entries for its stack.

Additionally, I made the following changes.
1) WAIT_EVENT_MERGE_APPEND_READY wait event instead of extending
WAIT_EVENT_APPEND_READY.  That should be less confusing for monitoring
purposes.
2) More tests: error handling with broken partition, plan-time
partition pruning, and run-time partition pruning tests for async
MergeAppend.

I'm going to went through this patchset another time tomorrow and push
it on Monday if there are no objections.

------
Regards,
Alexander Korotkov
Supabase

Вложения

Re: Asynchronous MergeAppend

От
Richard Guo
Дата:
On Sun, Apr 5, 2026 at 11:25 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I'm going to went through this patchset another time tomorrow and push
> it on Monday if there are no objections.

I completely understand the desire to get this committed ahead of the
feature freeze.  However, I'm concerned that a one-day notice over the
Easter weekend is simply too short for the community to see the
announcement, let alone provide feedback, especially since this is a
pretty big feature.

I don't have any specific technical feedback on the patchset itself,
as I haven't reviewed it.  My only hesitation is the short notice
period.  That said, if you are highly confident in its readiness, I
will defer to your judgment.

- Richard



Re: Asynchronous MergeAppend

От
Etsuro Fujita
Дата:
On Mon, Apr 6, 2026 at 12:40 PM Richard Guo <guofenglinux@gmail.com> wrote:
> On Sun, Apr 5, 2026 at 11:25 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > I'm going to went through this patchset another time tomorrow and push
> > it on Monday if there are no objections.
>
> I completely understand the desire to get this committed ahead of the
> feature freeze.  However, I'm concerned that a one-day notice over the
> Easter weekend is simply too short for the community to see the
> announcement, let alone provide feedback, especially since this is a
> pretty big feature.
>
> I don't have any specific technical feedback on the patchset itself,
> as I haven't reviewed it.  My only hesitation is the short notice
> period.  That said, if you are highly confident in its readiness, I
> will defer to your judgment.

First, my apologies for not having reviewed this patch.  I was
planning to do so, but didn't have time for that, due to other
priorities.

I hate to say this, but as mentioned by Richard, this is a pretty big,
complex feature, so I also think the one-day notice is too short.

Best regards,
Etsuro Fujita



Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Sat Apr 4, 2026 at 11:24 PM -03, Alexander Korotkov wrote:
> I made more work on the patchset.
>
> Patch #1 now considers IncrementalSort as exclusion alongside with
> Sort.  Exclusion check is now on the top of the switch().
> Patch #2 is split into 3 patches: common structures, common sync
> append logic, and common async append logic.
> New structs are now named AppendBase/AppendBaseState, corresponding
> fields are "ab" and "as".
>
> Most importantly I noted that this patchset actually only makes
> initial heap filling asynchronous.  The steady work after that is
> still syncnronous.  Even that it used async infrastructure, it fetched
> tuples from children subplans one-by-one: effectively synchronous but
> paying for asynchronous infrastructure.  I think even with this
> limitation, this patchset is valuable: the startup cost for children
> foreignscans can be high.  But this understanding allowed me to
> significantly simplify the main patch including:
> 1) After initial heap filling, use ExecProcNode() to fetch from children plans.
> 2) Remove ms_has_asyncresults entirely. Async responses store directly
> into ms_slots[] (the existing heap slot array), which serves as both
> the merge state and the "result arrived" indicator via TupIsNull().
> 3) Removed needrequest usage from MergeAppend. Since MergeAppend only
> fires initial requests (via ExecAppendBaseAsyncBegin()) and never
> sends follow-up requests, needrequest tracking is unnecessary.
> ExecMergeAppendAsyncRequest() was eliminated entirely.
> 4)  ExecMergeAppendAsyncGetNext() reduced to a simple wait loop:
> 5)  asyncresults allocation reduced back to nasyncplans.  MergeAppend
> doesn't use it (stores in ms_slots), and Append only needs nasyncplans
> entries for its stack.
>
> Additionally, I made the following changes.
> 1) WAIT_EVENT_MERGE_APPEND_READY wait event instead of extending
> WAIT_EVENT_APPEND_READY.  That should be less confusing for monitoring
> purposes.
> 2) More tests: error handling with broken partition, plan-time
> partition pruning, and run-time partition pruning tests for async
> MergeAppend.
>

Thanks for v17, the split of 0002 into multiple patches seems much
better. Overall I agree with the changes on v17 compared with v16, the
removal of ms_has_asyncresults makes the code better to read and follow.
The separate WAIT_EVENT_MERGE_APPEND_READY for better monitoring is also
good, I've tested some long running queries and I've find the event on
pg_stat_activity. The steady work changes also looks good.

One minor issue on 0002:

+    bool        valid_subplans_identified;    /* is valid_asyncplans valid? */
+    Bitmapset  *valid_subplans;
+    Bitmapset  *valid_asyncplans;    /* valid asynchronous plans indexes */

Should be /* is valid_subplans valid? */

-----

Minor comment on 0005:

+static void
+ExecMergeAppendAsyncGetNext(MergeAppendState *node, int mplan)
+{
+    /*
+     * All initial async requests were fired by ExecAppendBaseAsyncBegin.

Wondering if we should reference ExecMergeAppendAsyncBegin() instead of
ExecAppendBaseAsyncBegin() since this is on nodeMergeAppend, what do you
think?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com



Re: Asynchronous MergeAppend

От
Alexander Korotkov
Дата:
Hi Matheus,

Thank you for your feedback.

On Tue, Apr 7, 2026 at 2:48 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> On Sat Apr 4, 2026 at 11:24 PM -03, Alexander Korotkov wrote:
> > I made more work on the patchset.
> >
> > Patch #1 now considers IncrementalSort as exclusion alongside with
> > Sort.  Exclusion check is now on the top of the switch().
> > Patch #2 is split into 3 patches: common structures, common sync
> > append logic, and common async append logic.
> > New structs are now named AppendBase/AppendBaseState, corresponding
> > fields are "ab" and "as".
> >
> > Most importantly I noted that this patchset actually only makes
> > initial heap filling asynchronous.  The steady work after that is
> > still syncnronous.  Even that it used async infrastructure, it fetched
> > tuples from children subplans one-by-one: effectively synchronous but
> > paying for asynchronous infrastructure.  I think even with this
> > limitation, this patchset is valuable: the startup cost for children
> > foreignscans can be high.  But this understanding allowed me to
> > significantly simplify the main patch including:
> > 1) After initial heap filling, use ExecProcNode() to fetch from children plans.
> > 2) Remove ms_has_asyncresults entirely. Async responses store directly
> > into ms_slots[] (the existing heap slot array), which serves as both
> > the merge state and the "result arrived" indicator via TupIsNull().
> > 3) Removed needrequest usage from MergeAppend. Since MergeAppend only
> > fires initial requests (via ExecAppendBaseAsyncBegin()) and never
> > sends follow-up requests, needrequest tracking is unnecessary.
> > ExecMergeAppendAsyncRequest() was eliminated entirely.
> > 4)  ExecMergeAppendAsyncGetNext() reduced to a simple wait loop:
> > 5)  asyncresults allocation reduced back to nasyncplans.  MergeAppend
> > doesn't use it (stores in ms_slots), and Append only needs nasyncplans
> > entries for its stack.
> >
> > Additionally, I made the following changes.
> > 1) WAIT_EVENT_MERGE_APPEND_READY wait event instead of extending
> > WAIT_EVENT_APPEND_READY.  That should be less confusing for monitoring
> > purposes.
> > 2) More tests: error handling with broken partition, plan-time
> > partition pruning, and run-time partition pruning tests for async
> > MergeAppend.
> >
>
> Thanks for v17, the split of 0002 into multiple patches seems much
> better. Overall I agree with the changes on v17 compared with v16, the
> removal of ms_has_asyncresults makes the code better to read and follow.
> The separate WAIT_EVENT_MERGE_APPEND_READY for better monitoring is also
> good, I've tested some long running queries and I've find the event on
> pg_stat_activity. The steady work changes also looks good.
>
> One minor issue on 0002:
>
> +       bool            valid_subplans_identified;      /* is valid_asyncplans valid? */
> +       Bitmapset  *valid_subplans;
> +       Bitmapset  *valid_asyncplans;   /* valid asynchronous plans indexes */
>
> Should be /* is valid_subplans valid? */

Thank you for catching this, fixed.

> Minor comment on 0005:
>
> +static void
> +ExecMergeAppendAsyncGetNext(MergeAppendState *node, int mplan)
> +{
> +       /*
> +        * All initial async requests were fired by ExecAppendBaseAsyncBegin.
>
> Wondering if we should reference ExecMergeAppendAsyncBegin() instead of
> ExecAppendBaseAsyncBegin() since this is on nodeMergeAppend, what do you
> think?

Technically current comment is correct, because async requests are
essetially fired by ExecAppendBaseAsyncBegin().  But yes, since we're
on nodeMergeAppend, it would be less confusing to mention
ExecMergeAppendAsyncBegin().  Fixed.

------
Regards,
Alexander Korotkov
Supabase

Вложения

Re: Asynchronous MergeAppend

От
Alexander Korotkov
Дата:
Hi Richard,
Hi Etsuro,

On Mon, Apr 6, 2026 at 8:33 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>
> On Mon, Apr 6, 2026 at 12:40 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > On Sun, Apr 5, 2026 at 11:25 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > > I'm going to went through this patchset another time tomorrow and push
> > > it on Monday if there are no objections.
> >
> > I completely understand the desire to get this committed ahead of the
> > feature freeze.  However, I'm concerned that a one-day notice over the
> > Easter weekend is simply too short for the community to see the
> > announcement, let alone provide feedback, especially since this is a
> > pretty big feature.
> >
> > I don't have any specific technical feedback on the patchset itself,
> > as I haven't reviewed it.  My only hesitation is the short notice
> > period.  That said, if you are highly confident in its readiness, I
> > will defer to your judgment.
>
> First, my apologies for not having reviewed this patch.  I was
> planning to do so, but didn't have time for that, due to other
> priorities.
>
> I hate to say this, but as mentioned by Richard, this is a pretty big,
> complex feature, so I also think the one-day notice is too short.

Thank you for your feedback.  I would say that this patch is here for
quite long, and it's pretty straightforward.  It passed many rounds of
review by Matheus Alcantara.  I've done a lot of minor cleanups and
improvements, and reorganized changes into multiple patches.  The only
major change I did is actually a simplification which come from the
fact that only initial heap filling is effectively async [1].  Today
Matheus gave a feedback on my changes.

Surely, I wouldn't commit this patch without giving you a chance to
review.  We can postpone it till early PG20 development cycle.  But if
you find it possible to take a look at this patch during Apr 7, let me
know.

Links.
1. https://www.postgresql.org/message-id/CAPpHfdsO8zYpDW%3D%3DD6T5N0cJ%2BAzK7a_OyXJoYU1kFi%3DxZFTLuQ%40mail.gmail.com
2. https://www.postgresql.org/message-id/DHMH23M7UOFS.12W6PUDI1I3NH%40gmail.com

------
Regards,
Alexander Korotkov
Supabase



Re: Asynchronous MergeAppend

От
Etsuro Fujita
Дата:
On Tue, Apr 7, 2026 at 9:25 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Thank you for your feedback.  I would say that this patch is here for
> quite long, and it's pretty straightforward.  It passed many rounds of
> review by Matheus Alcantara.  I've done a lot of minor cleanups and
> improvements, and reorganized changes into multiple patches.  The only
> major change I did is actually a simplification which come from the
> fact that only initial heap filling is effectively async [1].  Today
> Matheus gave a feedback on my changes.

I think Matheus did a good job, but he said "I still don't have too
much experience with the executor code but I hope that I can help with
something.", and IIUC, his reviews were mostly about code
cleanup/deduplication, so ISTM that the patch hadn't been reviewed
that extensively, despite its complexity.  That was actually one of
the reasons why I lowered the priority of the patch.

> Surely, I wouldn't commit this patch without giving you a chance to
> review.  We can postpone it till early PG20 development cycle.  But if
> you find it possible to take a look at this patch during Apr 7, let me
> know.

Sorry, I don't have time for that.  I will defer to your judgment, too.

Thank all of you for working on this important feature, anyway!

Best regards,
Etsuro Fujita



Re: Asynchronous MergeAppend

От
"Matheus Alcantara"
Дата:
On Mon Apr 6, 2026 at 9:19 PM -03, Alexander Korotkov wrote:
>> Minor comment on 0005:
>>
>> +static void
>> +ExecMergeAppendAsyncGetNext(MergeAppendState *node, int mplan)
>> +{
>> +       /*
>> +        * All initial async requests were fired by ExecAppendBaseAsyncBegin.
>>
>> Wondering if we should reference ExecMergeAppendAsyncBegin() instead of
>> ExecAppendBaseAsyncBegin() since this is on nodeMergeAppend, what do you
>> think?
>
> Technically current comment is correct, because async requests are
> essetially fired by ExecAppendBaseAsyncBegin().  But yes, since we're
> on nodeMergeAppend, it would be less confusing to mention
> ExecMergeAppendAsyncBegin().  Fixed.
>

Thanks for the updated version.

I did another round of review and testing and it looks good to me, I did
not find any issue or strange behaviour.

With the refactor the async support for MergeAppend is not too
complicated. My main concern on this patch series is about the refactor
itself, if we are missing something or changing some behaviour without
notice, however as the  async/sync Append execution is also using the
refactored code and we have a good code coveraged I fell a bit confident
that we are right here.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com



Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Hi.

Looked at it. Overall I agree that when we wait for data from one slot 
after node initialization, we can't get data from other slots - they are 
already either exhausted, or have already received data which is not 
needed for now. So it seems that async machinery on later stages is 
useless.

Was a bit surprised that asyncresults field is not used in async merge 
append anymore, as we save result directly in ms_slots. However, as we 
do it only on initial stage, this seems to be OK.

classify_matching_subplans() comment still refers to ms_valid_subplans.

Will look at it once again tomorrow.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asynchronous MergeAppend

От
Alexander Pyhalov
Дата:
Alexander Pyhalov писал(а) 2026-04-13 18:09:
> Hi.
> 
> Looked at it. Overall I agree that when we wait for data from one slot 
> after node initialization, we can't get data from other slots - they 
> are already either exhausted, or have already received data which is 
> not needed for now. So it seems that async machinery on later stages is 
> useless.
> 
> Was a bit surprised that asyncresults field is not used in async merge 
> append anymore, as we save result directly in ms_slots. However, as we 
> do it only on initial stage, this seems to be OK.
> 
> classify_matching_subplans() comment still refers to ms_valid_subplans.
> 
> Will look at it once again tomorrow.

Haven't found anything suspicious (besides redundant empty line after 
enable_async_merge_append GUC description in 
src/backend/utils/misc/guc_parameters.dat). Tested it and was a bit 
surprised that new async Merge Append version (without async machinery 
after nodeMergeAppend initialization) was about 5% faster than the old 
one with concurrent queries (-j 16) and 10-15% faster with single-thread 
load (when there was a lot of CPU capacity)  in my tests (test was 
basically the same as in [1]). So, async machinery after Merge Append 
node is initialized was not only useless, but even harmful.

Test results:

patch version  | concurrency  | async_capable off | async_capable on
old            |  -c 1 -j 1   | 880  tps          | 1428 tps  (+62%)
old            |  -c 16 -j 16 | 5190 tps          | 4933 tps  (-5%)
new            |  -c 1 -j 1   | 888  tps          | 1582 tps  (+78%)
new            |  -c 16 -j 16 | 5020 tps          | 5256 tps  (+4%)

1. 
https://www.postgresql.org/message-id/159b591411bb2c81332018927acbd509%40postgrespro.ru
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional