Обсуждение: GEQO plans much slower than standard join plans

Поиск
Список
Период
Сортировка

GEQO plans much slower than standard join plans

От
Carlo Sganzerla
Дата:
Hello experts,

Recently, we've been encountering poor query plans on our database workload. We've successfully diagnosed the cause as reaching the default limit of 8 for join_collapse_limit.

We've already dealt with the situation successfully. However, I've been discussing with some colleagues the possibility of raising the join_collapse_limit parameter value (and from_collapse_limit with it) in our production environment. I was trying to determine what would be a sensible new value when I stumbled upon this thread on OLTP Star Joins: https://www.postgresql.org/message-id/1ea167aa-457d-422a-8422-b025bb660ef3%40vondra.me

Our data model is hierarchical, so we rather have child tables instead of the dimension tables mentioned above. I created a similar script as the one found on the OLTP Star Join thread to test the effects of different values of join_collapse_limit locally on an hierarchical model which is more similar to ours. I ran pg_bench with 30 threads and 30 clients for 5 seconds. The following results were obtained on my machine (AMD Ryzen 7 5700U with 8 cores and 32 GB RAM, PostgreSQL 17.5 (Debian 17.5-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit):

OLTP starjoin (10 dimensions)
- join_collapse_limit = 1: 6.4k TPS
- join_collapse_limit = 8: 2.8k TPS
- join_collapse_limit = 11: 400 TPS

Tree join (10 levels)
- join_collapse_limit = 1:  7.4k TPS
- join_collapse_limit = 8: 4.7k TPS
- join_collapse_limit = 11:  3.9k TPS

Tree join (14 levels)
- join_collapse_limit = 1:  5.1k TPS
- join_collapse_limit = 8:  3.4k TPS
- join_collapse_limit = 11:  3k TPS
- join_collapse_limit = 14, geqo = off:  2.2k TPS
- join_collapse_limit = 14, geqo = on: 200 TPS

Note: geqo_threshold was unchanged from the default (12)

I assume that the reason why the hierarchical "tree join" is much faster is due to the dependencies among tables, so the standard join search has a much narrower range of possible query paths compared to the OLTP Star Join case. What surprised me, however, is that when GEQO is turned on, the TPS falls dramatically. Given that the documentation states that GEQO "... reduces planning time for complex queries (those joining many relations), at the cost of producing plans that are sometimes inferior to those found by the normal exhaustive-search algorithm", it made me wonder what could be the cause of this much slower planning. I'm not really familiar with genetic algorithms, so perhaps I might be missing something, but is this kind of planning performance hit normal when GEQO is on? I was hoping someone could help us on this topic.

Regards,
Carlo
Вложения

Re: GEQO plans much slower than standard join plans

От
Tomas Vondra
Дата:
On 10/27/25 19:17, Carlo Sganzerla wrote:
> 
> I assume that the reason why the hierarchical "tree join" is much faster
> is due to the dependencies among tables, so the standard join search has
> a much narrower range of possible query paths compared to the OLTP Star
> Join case. What surprised me, however, is that when GEQO is turned on,
> the TPS falls dramatically. Given that the documentation states that
> GEQO "... reduces planning time for complex queries (those joining many
> relations), at the cost of producing plans that are sometimes inferior
> to those found by the normal exhaustive-search algorithm", it made me
> wonder what could be the cause of this much slower planning. I'm not
> really familiar with genetic algorithms, so perhaps I might be missing
> something, but is this kind of planning performance hit normal when GEQO
> is on? I was hoping someone could help us on this topic.

I'm not particularly familiar with the GEQO internals, so I can't point
at specific issues. But I've heard from a couple experienced developers
that they consider GEQO ineffective / not the right approach.

However, I don't think you need detailed knowledge of the GEQO internals
to explain the observed behavior.

The regular (non-GEQO) planning explores more or less all possible join
orders - we don't construct all of them thanks to dynamic programming,
but we only skip orders that we evaluate as not interesting. It's 100%
true, due to join_collapse_limit (which splits the problem into smaller
problems, and limits which part of the limit space we really search).

The idea of GEQO is that it reduces the search space even more, and it
explores only a very small fraction of join orders. The idea was to do
that in a smart way to still produce "quality" join orders, but the
experience is it's not reliable.

If shouldn't be difficult to demonstrate this using EXPLAIN. If you look
at the plans with/without geqo, I'd bet the geqo=on costs will have much
higher cost (which proves the geqo fails to find many "good" plans). Of
course, the execution might still be fast.

Another question is whether the difference is in planning or execution.
I'd expect geqo=on makes planning faster and execution slower, but maybe
that's not true for your test. It shouldn't be difficult to verify using
pg_stat_statements (which tracks both plan and exec time).


regards

-- 
Tomas Vondra




Re: GEQO plans much slower than standard join plans

От
Carlo Sganzerla
Дата:
> Another question is whether the difference is in planning or execution.
> I'd expect geqo=on makes planning faster and execution slower, but maybe
> that's not true for your test. It shouldn't be difficult to verify using
> pg_stat_statements (which tracks both plan and exec time).

We started experimenting and we already see some results that point out that we have not only better plans, but also faster plans. Our overall plan load was already somewhat low because of extensive use of prepared statements, so that in theory also reduces the load of not enabling GEQO. However, we tested a bit without prepared statements and had similar results as we did on the test.

I'm not sure if I'm overstepping here, but wouldn't it be worthwhile to add a little disclaimer on the docs regarding such cases? I guess that the hard thing about elaborating database documentation is that you have to document generically enough so the information makes sense to most reasonable workloads, so documenting every exception is not a good idea, but on this case I feel that the docs gave too much the impression that GEQO *always* improves planning performance. I was thinking of adding a little "addendum". I've attached a patch for your consideration.

> I'm not particularly familiar with the GEQO internals, so I can't point
> at specific issues. But I've heard from a couple experienced developers
> that they consider GEQO ineffective / not the right approach.

In my view, this "addendum" also leaves some room to these other considerations without compromising readability.

I'd like to hear your thoughts on that.

Regards,
Carlo

On Mon, Oct 27, 2025 at 10:24 PM Tomas Vondra <tomas@vondra.me> wrote:
On 10/27/25 19:17, Carlo Sganzerla wrote:
>
> I assume that the reason why the hierarchical "tree join" is much faster
> is due to the dependencies among tables, so the standard join search has
> a much narrower range of possible query paths compared to the OLTP Star
> Join case. What surprised me, however, is that when GEQO is turned on,
> the TPS falls dramatically. Given that the documentation states that
> GEQO "... reduces planning time for complex queries (those joining many
> relations), at the cost of producing plans that are sometimes inferior
> to those found by the normal exhaustive-search algorithm", it made me
> wonder what could be the cause of this much slower planning. I'm not
> really familiar with genetic algorithms, so perhaps I might be missing
> something, but is this kind of planning performance hit normal when GEQO
> is on? I was hoping someone could help us on this topic.

I'm not particularly familiar with the GEQO internals, so I can't point
at specific issues. But I've heard from a couple experienced developers
that they consider GEQO ineffective / not the right approach.

However, I don't think you need detailed knowledge of the GEQO internals
to explain the observed behavior.

The regular (non-GEQO) planning explores more or less all possible join
orders - we don't construct all of them thanks to dynamic programming,
but we only skip orders that we evaluate as not interesting. It's 100%
true, due to join_collapse_limit (which splits the problem into smaller
problems, and limits which part of the limit space we really search).

The idea of GEQO is that it reduces the search space even more, and it
explores only a very small fraction of join orders. The idea was to do
that in a smart way to still produce "quality" join orders, but the
experience is it's not reliable.

If shouldn't be difficult to demonstrate this using EXPLAIN. If you look
at the plans with/without geqo, I'd bet the geqo=on costs will have much
higher cost (which proves the geqo fails to find many "good" plans). Of
course, the execution might still be fast.

Another question is whether the difference is in planning or execution.
I'd expect geqo=on makes planning faster and execution slower, but maybe
that's not true for your test. It shouldn't be difficult to verify using
pg_stat_statements (which tracks both plan and exec time).


regards

--
Tomas Vondra

Вложения

Re: GEQO plans much slower than standard join plans

От
Tomas Vondra
Дата:
On 10/28/25 16:43, Carlo Sganzerla wrote:
>> Another question is whether the difference is in planning or execution.
>> I'd expect geqo=on makes planning faster and execution slower, but maybe
>> that's not true for your test. It shouldn't be difficult to verify using
>> pg_stat_statements (which tracks both plan and exec time).
> 
> We started experimenting and we already see some results that point out
> that we have not only better plans, but also faster plans. Our overall
> plan load was already somewhat low because of extensive use of prepared
> statements, so that in theory also reduces the load of not enabling
> GEQO. However, we tested a bit without prepared statements and had
> similar results as we did on the test.
> 

I'm not entirely sure I follow what tests you did, some of which might
not have been shared on the list. Also, what do you mean by "better
plans, but also faster plans"? What's the difference?

It seems to me you're implying you get faster planning with geqo=off.
That seem counter-intuitive to me, but maybe it can happen. I however
don't see any example demonstrating that in the results you shared (and
I already suggested what data to look at).

> I'm not sure if I'm overstepping here, but wouldn't it be worthwhile to
> add a little disclaimer on the docs regarding such cases? I guess that
> the hard thing about elaborating database documentation is that you have
> to document generically enough so the information makes sense to most
> reasonable workloads, so documenting every exception is not a good idea,
> but on this case I feel that the docs gave too much the impression that
> GEQO *always* improves planning performance. I was thinking of adding a
> little "addendum". I've attached a patch for your consideration.
> 
>> I'm not particularly familiar with the GEQO internals, so I can't point
>> at specific issues. But I've heard from a couple experienced developers
>> that they consider GEQO ineffective / not the right approach.
> 
> In my view, this "addendum" also leaves some room to these other
> considerations without compromising readability.
> 
> I'd like to hear your thoughts on that.
> 

Dunno. I'm not convinced geqo=on can increase planning time. Or maybe I
don't understand the results you've shared.


regards

-- 
Tomas Vondra




Re: GEQO plans much slower than standard join plans

От
Carlo Sganzerla
Дата:
> I'm not entirely sure I follow what tests you did, some of which might
> not have been shared on the list. Also, what do you mean by "better
> plans, but also faster plans"? What's the difference?

Sorry, I should have been clearer. With "better plans" I meant faster execution times and with "faster plans" I mean faster planning times.

> It seems to me you're implying you get faster planning with geqo=off.
> That seem counter-intuitive to me, but maybe it can happen. I however
> don't see any example demonstrating that in the results you shared (and
> I already suggested what data to look at).

Yes, I also found that counterintuitive. By turning geqo = off (join/from_collapse_limit were 14) and looking at some metrics we obtained from pg_stat_statements, we found that planning times of the affected queries (the ones which would make use of GEQO) mostly increased (but not by much), which is already the documented behavior. However, one particular query with the same structure like the one on treejoin-14-dims.sql (attached to the first email) had a _decrease_ on planning and execution times, so geqo = off for this query yielded faster plans than geqo = on. This counterintuitive behavior is the reason that I created and shared the script that tried to reproduce it. I'm not sure whether you've run it or not, but on every machine I have run so far I had a 10 fold increase on TPS and a 10 fold decrease on planning times if I turn geqo = off.

I guess what I'm trying to show is that, in very particular cases, GEQO may harm planning times, that's why I felt a little disclaimer could be useful. I'm not sure what else I can add here.

Regards,
Carlo  


On Tue, Oct 28, 2025 at 11:05 PM Tomas Vondra <tomas@vondra.me> wrote:
On 10/28/25 16:43, Carlo Sganzerla wrote:
>> Another question is whether the difference is in planning or execution.
>> I'd expect geqo=on makes planning faster and execution slower, but maybe
>> that's not true for your test. It shouldn't be difficult to verify using
>> pg_stat_statements (which tracks both plan and exec time).
>
> We started experimenting and we already see some results that point out
> that we have not only better plans, but also faster plans. Our overall
> plan load was already somewhat low because of extensive use of prepared
> statements, so that in theory also reduces the load of not enabling
> GEQO. However, we tested a bit without prepared statements and had
> similar results as we did on the test.
>

I'm not entirely sure I follow what tests you did, some of which might
not have been shared on the list. Also, what do you mean by "better
plans, but also faster plans"? What's the difference?

It seems to me you're implying you get faster planning with geqo=off.
That seem counter-intuitive to me, but maybe it can happen. I however
don't see any example demonstrating that in the results you shared (and
I already suggested what data to look at).

> I'm not sure if I'm overstepping here, but wouldn't it be worthwhile to
> add a little disclaimer on the docs regarding such cases? I guess that
> the hard thing about elaborating database documentation is that you have
> to document generically enough so the information makes sense to most
> reasonable workloads, so documenting every exception is not a good idea,
> but on this case I feel that the docs gave too much the impression that
> GEQO *always* improves planning performance. I was thinking of adding a
> little "addendum". I've attached a patch for your consideration.
>
>> I'm not particularly familiar with the GEQO internals, so I can't point
>> at specific issues. But I've heard from a couple experienced developers
>> that they consider GEQO ineffective / not the right approach.
>
> In my view, this "addendum" also leaves some room to these other
> considerations without compromising readability.
>
> I'd like to hear your thoughts on that.
>

Dunno. I'm not convinced geqo=on can increase planning time. Or maybe I
don't understand the results you've shared.


regards

--
Tomas Vondra

Re: GEQO plans much slower than standard join plans

От
David Rowley
Дата:
On Thu, 30 Oct 2025 at 03:57, Carlo Sganzerla <carlo@alude.com.br> wrote:
> Yes, I also found that counterintuitive. By turning geqo = off (join/from_collapse_limit were 14) and looking at some
metricswe obtained from pg_stat_statements, we found that planning times of the affected queries (the ones which would
makeuse of GEQO) mostly increased (but not by much), which is already the documented behavior. However, one particular
querywith the same structure like the one on treejoin-14-dims.sql (attached to the first email) had a _decrease_ on
planningand execution times, so geqo = off for this query yielded faster plans than geqo = on. This counterintuitive
behavioris the reason that I created and shared the script that tried to reproduce it. I'm not sure whether you've run
itor not, but on every machine I have run so far I had a 10 fold increase on TPS and a 10 fold decrease on planning
timesif I turn geqo = off. 
>
> I guess what I'm trying to show is that, in very particular cases, GEQO may harm planning times, that's why I felt a
littledisclaimer could be useful. I'm not sure what else I can add here. 

I think what's happening here is that in the query as you've written
it, there isn't much flexibility in the how the tables are joined.
There are 14 EquivalanceClasses in the 14 table case. "f" cannot be
joined to "c13" without "c12", and joining "c12" requires "c11" and so
on.  For queries that join all on the same column, there'd be fewer
EquivalanceClasses and more flexibility in the join order.

Doing:

\set id random(1,10000)

set join_collapse_limit = 14;
set geqo = off;

explain (summary on) select * from f
    join c0 on (c0.id = f.id)
    join c1 on (c1.id = c0.id)
    join c2 on (c1.id = c0.id)
    join c3 on (c3.id = c2.id)
    join c4 on (c4.id = c3.id)
    join c5 on (c5.id = c4.id)
    join c6 on (c6.id = c5.id)
    join c7 on (c7.id = c6.id)
    join c8 on (c8.id = c7.id)
    join c9 on (c9.id = c8.id)
    join c10 on (c10.id = c9.id)
    join c11 on (c11.id = c10.id)
    join c12 on (c12.id = c11.id)
    join c13 on (c13.id = c12.id)
where f.id = :id;

Is much slower to plan as it opens up the possibilities for join order
as there's a single EquivalanceClass with 14 members and any
combination is possible.

I've not studied the GEQO code nearly as much as the standard join
search code, but what it seems to do is try random combinations. For
the case you demonstrated in [1], it discovers a large number of pairs
that cannot be joined as there's no join condition between them. I
suspect that's the issue here. The standard search manages to just
incrementally build up towards the final join incrementally as it's
processing the tables in the same order as they appear in the query
and it never hits the case of not finding a join condition between the
pairs it tries.

Maybe there's some way to make GEQO more efficient by precalculating
which pairs might be good ones to try. That likely could be done
fairly efficiently now that we have RelOptInfo.eclass_indexes, or at
least it would help identify ones that there's no point in trying.
Maybe someone wants to study this more than I have and see what might
be possible. I've not quite got the motivation, at least not today.

David

[1] https://www.postgresql.org/message-id/attachment/183634/treejoin-14-dims.sql