Re: [PATCH] Erase the distinctClause if the result is unique by definition

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Re: [PATCH] Erase the distinctClause if the result is unique by definition
Дата
Msg-id CAKU4AWrzTHEC7kjsxr5fWzBWmd4hKj6rB1GU0Ut0aXoEDEQOTw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Erase the distinctClause if the result is unique by definition  (David Rowley <dgrowleyml@gmail.com>)
Ответы Re: [PATCH] Erase the distinctClause if the result is unique by definition  (Andy Fan <zhihui.fan1213@gmail.com>)
Список pgsql-hackers


On Fri, Mar 13, 2020 at 11:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 13 Mar 2020 at 14:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> 1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as well.
> (relation_has_unqiue_for has a similar logic) currently the following distinct path is still
>  there.

Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
mention that still needs to be done. 
> postgres=# explain select distinct b from t100 where pk = 1;
>                                     QUERY PLAN
> ----------------------------------------------------------------------------------
>  Unique  (cost=8.18..8.19 rows=1 width=4)
>    ->  Sort  (cost=8.18..8.19 rows=1 width=4)
>          Sort Key: b
>          ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1 width=4)
>                Index Cond: (pk = 1)
> (5 rows)
>
> I think in this case,  we can add  both (pk) and (b) as the UniquePaths.  If so we
> can get more opportunities to reach our goal.

The UniqueKeySet containing "b" could only be added in the
distinct_rel in the upper planner. It must not change the input_rel
for the distinct.

I think we maintain UniqueKey even without distinct_rel,  so at this stage, 
Can we say b is unique for this(no is possible)?  If yes,  we probably 
need to set that information without consider the distinct clause. 

It's likely best to steer clear of calling UniqueKeys UniquePaths as
it might confuse people.  The term "path" is used in PostgreSQL as a
lightweight representation containing all the information required to
build a plan node in createplan.c. More details in
src/backend/optimizer/README.

 
OK. 
 
> 2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more UniquePath as
> (rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply.
> or else the following cases can't be handled.
>
> postgres=# explain select distinct t100.pkt101.pk from t100, t101;
>                                    QUERY PLAN
> --------------------------------------------------------------------------------
>  Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
>    ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
>          Sort Key: t100.pk, t101.pk
>          ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
>                ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
>                ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
>                      ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260 width=4)
> (7 rows)

I don't really follow what you mean here. It seems to me there's no
way we can skip doing DISTINCT in the case above.  If you've just
missed out the join clause and you meant to have "WHERE t100.pk =
t101.pk", then we can likely fix that later with some sort of
functional dependency tracking.

In the above case the result should be unique, the knowledge behind that
is if *we join 2 unique results in any join method, the result is unique as well*
in the above example,  the final unique Key is (t100.pk, t101.pk).   
 
Likely we can just add a Relids field
to UniqueKeySet to track which relids are functionally dependant on a
row from the UniqueKeySet's uk_exprs.  That might be as simple as just
pull_varnos from the non-matched exprs and checking to ensure the
result is a subset of functionally dependant rels.  I'd need to give
that more thought.

Was this a case you had working in your patch?

I think we can do that after I get your UniqueKey idea,  so, no,  my previous patch is not as smart
as yours:)  


> But if we add such rule,  the unique paths probably become much longer,  so we need
> a strategy to tell if the UniquePath is useful for our query,  if not,  we can ignore that.
> rel->reltarget maybe a good info for such optimization.   I think we can take this into
> consideration for pupulate_baserel_uniquekeys as well.

I don't really think the number of unique indexes in a base rel will
really ever get out of hand for legitimate cases.
propagate_unique_keys_to_joinrel is just concatenating baserel
UniqueKeySets to the joinrel. They're not copied, so it's just tagging
pointers onto the end of an array, which is at best a memcpy(), or at
worst a realloc() then memcpy(). That's not so costly.

The memcpy is not key concerns here.  My main point is we need
to focus on the length of RelOptInfo->uniquekeys.  For example:  
t  has 3 uk like this (uk1),  (uk2), (uk3).  And the query is
select b from t where m = 1;  If so there is no need to add these 3
to UniqueKeys so that we can keep rel->uniquekeys shorter.   

The the length of rel->uniquekeys maybe a  concern if we add the rule 
I suggested above , the (t100.pk, t101.pk) case.   Think about this
for example:

1. select .. from t1, t2, t3, t4...;  
2. suppose each table has 2 UniqueKeys, named (t{m}_uk{n}) 
3. follow my above rule (t1.pk1, t2.pk)is a UniqueKey for joinrel. 
4. suppose we join with the following order (t1 vs t2 vs t3 vs t4)

For for (t1 vs t2),  we need to add 4 more UniqueKeys for this joinrel. 
(t1_uk1 , t2_uk1), (t1_uk1, t2_uk2), (t1_uk2 , t2_uk1), (t1_uk2, t2_uk2)

After we come to join the last one, the joinrel->uniquekey will be much longer
which makes the scan of it less efficient. 

But this will not be an issue if my above rule should not be considered.  so 
we need to talk about that first. 

> For the non_null info,  Tom suggested to add maintain such info RelOptInfo,
> I have done that for the not_null_info for basic relation catalog,  I think we can
> maintain the same flag for joinrel and the not null info from find_nonnullable_vars as
> well, but I still didn't find a good place to add that so far.

I'd considered just adding a get_notnull() function to lsyscache.c.
Just below get_attname() looks like a good spot.  I imagined just
setting the bit in the UniqueKeySet's non_null_keys field
corresponding to the column position from the index.  I could see the
benefit of having a field in RelOptInfo if there was some way to
determine the not-null properties of all columns in the table at once,

do you mean get the non-null  properties from catalog or restrictinfo? 
if you mean catalog,  get_relation_info may be a good place for that. 

but there's not, so we're likely best just looking at the ones that
there are unique indexes on.
 
> A small question about the following code:
>
> +       if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
> +       {
> +
> +               add_path(distinct_rel, (Path *)cheapest_input_path);
> +
> +               /* XXX yeah yeah, need to call the hooks etc. */
> +
> +               /* Now choose the best path(s) */
> +               set_cheapest(distinct_rel);
> +
> +               return distinct_rel;
> +       }
>
> Since we don't create new RelOptInfo/Path,  do we need to call add_path and set_cheapest?

The distinct_rel already exists. add_path() is the standard way we
have of adding paths to the rel's pathlist.  Why would you want to
bypass that? set_cheapest() is our standard way of looking at the
pathlist and figuring out the least costly one. It's not a very hard
job to do when there's just 1 path. Not sure why you'd want to do it
another way.

I got the point now.  In this case,  you create an new RelOptInfo  
named distinct_rel, so we *must* set it.  Can we just return the input_rel 
in this case? if we can,  we don't need that. 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: The flinfo->fn_extra question, from me this time.
Следующее
От: "Andrey M. Borodin"
Дата:
Сообщение: Re: pglz performance