Re: Is there value in having optimizer stats for joins/foreignkeys?
| От | Alexandra Wang |
|---|---|
| Тема | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Дата | |
| Msg-id | CAK98qZ2mW=geT9NKe5vC68-sB9EJe_887uV=MCFt6y9AhyTp7A@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Is there value in having optimizer stats for joins/foreignkeys? (Corey Huinker <corey.huinker@gmail.com>) |
| Список | pgsql-hackers |
Hi there,
On Mon, Dec 1, 2025 at 7:16 PM Corey Huinker <corey.huinker@gmail.com> wrote:
On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:
I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.I envisioned keying the stats off the foreign key constraint id, or adding "starefrelid" (relation oid of the referencing table) to pg_statistic or a table roughly the same shape as pg_statistic.
On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:
FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.That's the other half of this - if the stats existed, do we have an obvious way to put them to use?
I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.
In my POC patch, I've made the following catalog changes:
- Add stxotherrel (oid) and stxjoinkeys (int2vector) fields to pg_statistic_ext- Use the existing stxkeys (int2vector) to store the stats object attributes of stxotherrel
- Create pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)
- Add stxdjoinmcv (pg_join_mcv_list) to pg_statistic_ext_data
To use them, we can let the planner detect patterns like this:
/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter columns in other_rel */
List *filter_values; /* list of Datum constant values being filtered */
Oid collation; /* collation for comparisons */
/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;
and add the detection logic in clauselist_selectivity_ext() and get_foreign_key_join_selectivity().
Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.
I have experimented with two ways to define the join statistics:
1. Use CREATE STATISTICS:
CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON { table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM table_name1 JOIN table_name2 ON table_name1.column_name3 = table_name2.column_name4
Examples:
-- Create join MCV statistics on a single filter column (keyword)CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
ON k.keyword
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;
-- Create join MCV statistics on multiple filter columns (keyword + phonetic_code):
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;
2. Auto join stats creation for Foreign Key + Functional Dependency stats
Initially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.
I've been using the Join Order Benchmark (JOB) [1] to measure
performance gain. I will post the POC patch and performance numbers in
a followup email.
performance gain. I will post the POC patch and performance numbers in
a followup email.
Best,
Alex
Alexandra Wang
В списке pgsql-hackers по дате отправления: