make add_paths_to_append_rel aware of startup cost

Поиск

Список

Период

Сортировка

От	Andy Fan
Тема	make add_paths_to_append_rel aware of startup cost
Дата	6 сентября 2023 г. 15:39:56
Msg-id	CAKU4AWrXSkUV=Pt-gRxQT7EbfUeNssprGyNsB=5mJibFZ6S3ww@mail.gmail.com обсуждение исходный текст
Ответы	Re: make add_paths_to_append_rel aware of startup cost (Andy Fan <zhihui.fan1213@gmail.com>) Re: make add_paths_to_append_rel aware of startup cost (David Rowley <dgrowleyml@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Hi:

This thread is a refactor of thread [1] for easier communication.

Currently add_paths_to_append_rel overlooked the startup cost for creating
append path, so it may have lost some optimization chances. After a glance,

the following 4 identifiers can be impacted.

Identifier 1:

SELECT .. FROM v1
UNION ALL
SELECT .. FROM v2
LIMIT 3;

Identifier 2:

SELECT * FROM p .. LIMIT 3; p is a partitioned table.

Identifier 3:
SELECT * FROM p JOIN q using (partkey) LIMIT 3;

If we did the partition-wise-join, then we lost the chances for a better
plan.

Identifier 4: -- EXISTS implies LIMIT 1;
SELECT * FROM foo
WHERE EXISTS
(SELECT 1 FROM append_rel_v_not_pullup_able WHERE xxx);

However, after I completed my patch and wanted to build some real

queries to prove my idea, I just find it is hard to build the case for

Identifier 2/3/4. But the Improvement for Identifier 1 is easy and

my real user case in work is Identifier 1 as well.

So a patch is attached for this case, it will use fractional costs
rather than total costs if needed. The following items needs more
attention during development.

- We shouldn't do the optimization if there are still more tables to join,
the reason is similar to has_multiple_baserels(root) in
set_subquery_pathlist. But the trouble here is we may inherit multiple
levels to build an appendrel, so I have to keep the 'top_relids' all the
time and compare it with PlannerInfo.all_baserels. If they are the same,

then it is the case we want to optimize.

- add_partial_path doesn't consider the startup costs by design, I didn't
rethink too much about this, but the idea of "choose a path which
let each worker produces the top-N tuples as fast as possible" looks
reasonable, and even though add_partial_path doesn't consider startup
cost, it is still possible that childrel keeps more than 1 partial paths
due to any reasons except startup_cost, for example PathKey. then we
still have chances to choose the cheapest fractional path among
them. The same strategy also applies to mixed partial and non-partial
append paths.

- Due to the complexity of add_paths_to_append_rel, 3 arguments have

to be added to get_cheapest_fractional_path...

Path *
get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction,
bool allow_parameterized, bool look_partial,
bool must_parallel_safe)

Cases can be improved.

Here is the simplest test case, but it will not be hard to provide more

cases for Identifier 1.

(select * from tenk1 order by hundred)
UNION ALL
(select * from tenk1 order by hundred)
limit 3;

master: 8.096ms.
patched: 0.204ms.

The below user case should be more reasonable for real business.

with a as (select * from t1 join t2..),
b as (select * from t1 join t3 ..)
select * from a union all select * from b
limit 3;

The patch would also have impacts on identifier 2/3/4, even though I can't

make a demo sql which can get benefits from this patch, I also added

some test cases for code coverage purposes.

Any feedback is welcome!

[1] https://www.postgresql.org/message-id/flat/CAKU4AWqEnzhUTxopVhENC3vs6NnYV32+e6GSBtp1rAv0ZNX=mQ@mail.gmail.com

Best Regards

Andy Fan