Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Дата
Msg-id CAFjFpReEb3MQ3nobZW49vzZz_sYjCeiw+pFDdyNnpXG+StNWCw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Wed, Oct 4, 2017 at 9:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I decided to skip over 0001 for today and spend some time looking at
>> 0002-0006.
>
> Back to 0001.
>
> +        Enables or disables the query planner's use of partition-wise join
> +        plans. When enabled, it spends time in creating paths for joins between
> +        partitions and consumes memory to construct expression nodes to be used
> +        for those joins, even if partition-wise join does not result in the
> +        cheapest path. The time and memory increase exponentially with the
> +        number of partitioned tables being joined and they increase linearly
> +        with the number of partitions. The default is <literal>off</>.
>
> I think this is too scary and too much technical detail.  I think you
> could just say something like: Enables or disables use of
> partition-wise join, which allows a join between partitioned tables to
> be performed by joining the matching partitions.  Partition-wise join
> currently applies only when the join conditions include all the
> columns of the partition keys, which must be of the same data type and
> have exactly matching sets of child partitions.  Because
> partition-wise join planning can use significantly increase CPU time
> and memory usage during planning, the default is <literal>off</>.

Done. With slight change. "include all the columns of the partition
keys" has a different meaning when partition key is an expression, so
I have used "include all the partition keys". Also changed the last
sentence as "... can use significantly more CPU time and memory during
planning ...". Please feel free to revert those changes, if you don't
like them.

>
> +partitioned table. The join partners can not be found in other partitions. This
> +condition allows the join between partitioned tables to be broken into joins
> +between the matching partitions. The resultant join is partitioned in the same
>
> "The join partners can not be found in other partitions." is redundant
> with the previous sentence.  I suggest deleting it.  I also suggest
> "This condition allows the join between partitioned tables to be
> broken" -> "Because of this, the join between partitioned tables can
> be broken".

Done.

>
> +relation" for both partitioned table as well as join between partitioned tables
> +which can use partition-wise join technique.
>
> for either a partitioned table or a join between compatibly partitioned tables

Done.

>
> +Partitioning properties of a partitioned relation are stored in
> +PartitionSchemeData structure. Planner maintains a list of canonical partition
> +schemes (distinct PartitionSchemeData objects) so that any two partitioned
> +relations with same partitioning scheme share the same PartitionSchemeData
> +object. This reduces memory consumed by PartitionSchemeData objects and makes
> +it easy to compare the partition schemes of joining relations.
>
> Not all of the partitioning properties are stored in the
> PartitionSchemeData structure any more.  I think this needs some
> rethinking and maybe some expansion.  As written, each of the first
> two sentences needs a "the" at the beginning.

Changed to

The partitioning properties of a partitioned relation are stored in its
RelOptInfo.  The information about data types of partition keys are stored in
PartitionSchemeData structure. The planner maintains a list of canonical
partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
any two partitioned relations with same partitioning scheme point to the same
PartitionSchemeData object.  This reduces memory consumed by
PartitionSchemeData objects and makes it easy to compare the partition schemes
of joining relations.

Let me know if this looks good.

>
> +                               /*
> +                                * Create "append" paths for
> partitioned joins. Do this before
> +                                * creating GatherPaths so that
> partial "append" paths in
> +                                * partitioned joins will be considered.
> +                                */
>
> I think you could shorten this to a single-line comment and just keep
> the first sentence.  Similarly in the other location where you have
> the same sort of thing.

Done.

>
> + * child-joins. Otherwise, add_path might delete a path that some "append"
> + * path has reference to.
>
> to which some path generated here has a reference.

Done.

>
> Here and elsewhere, you use "append" rather than Append to refer to
> the paths added.  I suppose that's weasel-wording to work around the
> fact that they might be either Append or MergeAppend paths, but I'm
> not sure it's really going to convey that to anyone.  I suggest
> rephrasing those comments more generically, e.g.:
>
> +       /* Add "append" paths containing paths from child-joins. */
>
> You could say: Build additional paths for this rel from child-join paths.
>
> Or something.

Done. Removed word "append" from the comments in merge_clump(),
standard_join_search() and prologue of
generate_partition_wise_join_paths(). Changed the last comment as per
your suggestion.

>
> +       if (!REL_HAS_ALL_PART_PROPS(rel))
> +               return;
> Isn't this an unnecessarily expensive test?  I mean, it shouldn't be
> possible for it to have some arbitrary subset.

All this function cares about is whether the given relation has any
partitions which can be simply checked by rel->nparts > 0 and
rel->part_rels != NULL. We need to explicitly check part_rels because
an outer join which has empty inner side in every pair will have
part_scheme, partbounds, nparts all set, but not part_rels. See
relevant comments in try_partition_wise_join() for more details. I
have now replaced macro with checks on rel->nparts and rel->part_rels.
This would change with the last patch dealing with partition-wise join
involving dummy relations. Once we have that an outer join like above
will also have part_rels set. But even then I think checking for
part_rels and nparts makes more sense than part_scheme and partbounds.

>
> +       /*
> +        * Every pair of joining relations we see here should have an equi-join
> +        * between partition keys if this join has been deemed as a partitioned
> +        * join. See build_joinrel_partition_info() for reasons.
> +        */
> +       Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
> +
> parent_restrictlist));
>
> I suggest removing this assertion.  Seems like overkill to me.

I thought it was good to have there to catch any bug breaking that
rule. But I have removed it as per your suggestion.
Do you think we should remove following assertions as well?   /*    * Since we allow partition-wise join only when the
partitionbounds of    * the joining relations exactly match, the partition bounds of the join    * should match those
ofthe joining relations.    */   Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
       joinrel->part_scheme->parttyplen,                                 joinrel->part_scheme->parttypbyval,
                    joinrel->boundinfo, rel1->boundinfo));
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
joinrel->part_scheme->parttyplen,                                joinrel->part_scheme->parttypbyval,
            joinrel->boundinfo, rel2->boundinfo));
 

>
> +               child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
> +
>                     child_rel1->relids,
> +
>                     child_rel2->relids);
>
> It seems like we might end up doing this multiple times for the same
> child join, if there are more than 2 tables involved.  Not sure if
> there's a good way to avoid that.

IIUC every pair of joining relations will use a different sjinfo,  A
LEFT JOIN B LEFT JOIN C will have two sjinfos one for AB and other for
BC. For ABC we will use the one for AB to join A with BC and we will
use one for BC to join AB with C. I agree that we are building sjinfo
for AB twice once for joining AB and then for A(BC). In order to avoid
that we will have to somehow link the parent sjinfo with child sjinfo
and avoid translating parent sjinfo again and again. May be the parent
sjinfo can contain a cache of child sjinfos.Do we want to do that in
this patch set? We could avoid translation entirely, if we could use
parent sjinfo for joining children. But that's a pretty deep surgery.

> Similarly for child_restrictlist.

Similary for restrictlist. Every joining pair has a different
restrictlist. Otherwise, we would have saved restrictlist in the
joinrel itself.

>
> +       pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);
>
> Just  do bool pk_has_clause[PARTITION_MAX_KEYS] instead.  Stack
> allocation is a lot faster, and then you don't need to pfree it.

That's a good idea. Done.

>
> +       /* Remove the relabel decoration. */
>
> the -> any, decoration -> decorations

Done.

>
> +       /*
> +        * Replace the Var nodes of parent with those of children in
> expressions.
> +        * This function may be called within a temporary context, but the
> +        * expressions will be shallow-copied into the plan. Hence copy those in
> +        * the planner's context.
> +        */
>
> I can't make heads or tails of this comment.

haha! My bad. the second sentence is something left of the code where
the child-joins used to be planned in a temporary memory context.
That's not true any more. Removed the entire comment.

>
> --- a/src/backend/optimizer/util/pathnode.c
> +++ b/src/backend/optimizer/util/pathnode.c
> @@ -23,7 +23,9 @@
>  #include "optimizer/pathnode.h"
>  #include "optimizer/paths.h"
>  #include "optimizer/planmain.h"
> +#include "optimizer/prep.h"
>  #include "optimizer/restrictinfo.h"
> +#include "optimizer/tlist.h"
>  #include "optimizer/var.h"
>  #include "parser/parsetree.h"
>  #include "utils/lsyscache.h"
>
> Maybe not needed?  This is the only hunk in this file?  Or should this
> be part of one of the later patches?

I think 0005. Sorry. I will move it there.

>
> +       Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel));
> +
> +       /* Ensure child relation is really what it claims to be. */
> +       Assert(IS_OTHER_REL(childrel));
>
> I suggest tightening this up a bit by removing the comment and the
> blank line that precedes it.

Done.

>
> +       foreach(lc, parentrel->reltarget->exprs)
> +       {
> +               PlaceHolderVar *phv = lfirst(lc);
> +
> +               if (IsA(phv, PlaceHolderVar))
> +               {
> +                       /*
> +                        * In case the placeholder Var refers to any
> of the parent
> +                        * relations, translate it to refer to the
> corresponding child.
> +                        */
> +                       if (bms_overlap(phv->phrels, parentrel->relids) &&
> +                               childrel->reloptkind == RELOPT_OTHER_JOINREL)
> +                       {
> +                               phv = (PlaceHolderVar *)
> adjust_appendrel_attrs(root,
> +
>                                                          (Node *) phv,
> +
>                                                          nappinfos,
> +
>                                                          appinfos);
> +                       }
> +
> +                       childrel->reltarget->exprs =
> lappend(childrel->reltarget->exprs,
> +
>                           phv);
> +                       phv_added = true;
> +               }
> +       }
>
> What if the PHV is buried down inside the expression someplace rather
> than being at the top level?

That can't happen. See add_placeholders_to_joinrel(), which adds these
placeholders to joinrel's target. That function adds PHVs as bare
nodes, not embedded into something else.

>  More generally, why are we not just
> applying adjust_appendrel_attrs() to the whole expression?

Usually targetlists of join have Var nodes which bubble up from the
base relations. Even PHVs bubble up from the lowest join where they
can be evaluated. If we translate reltarget, we will allocate new Var
nodes for every join relation consuming more memory and then setrefs
will need to compare the contents of those nodes instead of just
pointer comparison. We use this code and attr_needed to avoid memory
consumption and setref's CPU consumption.

>
> +       /* Adjust the cost and width of child targetlist. */
> +       if (phv_added)
> +       {
> +               childrel->reltarget->cost.startup =
> parentrel->reltarget->cost.startup;
> +               childrel->reltarget->cost.per_tuple =
> parentrel->reltarget->cost.per_tuple;
> +               childrel->reltarget->width = parentrel->reltarget->width;
> +       }
>
> Making this conditional on phv_added is probably not saving anything.
> Branches are expensive.

Ok.

If there are not PHVs in the query i.e. when root->placeholders_list
is NIL, we don't need to scan reltarget->exprs. I have added that
optimization.

>
>                 /*
>                  * Otherwise, anything in a baserel or joinrel
> targetlist ought to be
> -                * a Var.  (More general cases can only appear in
> appendrel child
> -                * rels, which will never be seen here.)
> +                * a Var or ConvertRowtypeExpr. For either of those,
> find the original
> +                * baserel where they originate.
>                  */
>
> Hmm, but now we could potentially see an appendrel child rel here, so
> don't we need to worry about more general cases?  If not, let's
> explain why not.

By more general cases, that comment means ConvertRowtypeExpr or
RowExpr, nothing else. A base relation's tlist can have only Var nodes
when it reaches this comment. When a parent Var node is subjected to
adjust_appendrel_attrs() it is translated to a Var node for all
varattnos except 0, which indicates a whole-row var. For a child
table, a whole-row var is always a named row type and hence gets
translated to a ConvertRowExpr. Other kinds of children (subqueries in
union etc.) can not appear here since they do not participate in a
join directly. So it's really a Var and ConvertRowtypeExpr. I have
modified the comment to explain this.

>
> +                        * if, it's a ConvertRowtypeExpr, it will be
> computed only for the
>
> American usage does not put a comma after if like this (unless you are
> writing writing if, for example, blah blah blah -- but there the
> commas are to surround for example, not due to the if itself).

That comma was unintentional. Removed.

>
> +/*
> + * build_joinrel_partition_info
> + *             If the join between given partitioned relations is
> possibly partitioned
> + *             set the partitioning scheme and partition keys
> expressions for the
> + *             join.
> + *
> + * If the two relations have same partitioning scheme, their join may be
> + * partitioned and will follow the same partitioning scheme as the joining
> + * relations.
> + */
>
> I think you could drop the primary comment block and use the secondary
> block as the primary one.  That is, get rid of "If the join
> between..." and promote "If the two relations...".

Done.

>
> +        * The join is not partitioned, if any of the relations being joined are
>
> Another comma that's not typical of American usage.

Done.

>
> +        * For an N-way inner join, where every syntactic inner join
> has equi-join
>
> has -> has an
>
> +        * For an N-way join with outer joins, where every syntactic join has an
> +        * equi-join between partition keys and a matching partitioning scheme,
> +        * outer join reordering identities in optimizer/README imply that only
> +        * those pairs of join are legal which have an equi-join
> between partition
> +        * keys. Thus every pair of joining relations we see for this
> join should
> +        * have an equi-join between partition keys if this join has been deemed
> +        * as a partitioned join.
>
> In line 2, partition keys -> the partition keys
> In line 3, outer join -> the outer join
>
> "pairs of join" sounds wrong too, although I'm not sure how to reword it.
>
> More broadly: I don't think I understand this comment.  The statement
> about "those pairs of join are legal which have an equi-join between
> partition keys" doesn't match my understanding e.g. A IJ B ON A.x =
> B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such
> clause syntatically.
>
> Maybe you could replace this whole comment block with something like
> this: We can only consider this join as an input to further
> partition-wise joins if (a) the input relations are partitioned, (b)
> the partition schemes match, and (c) we can identify an equi-join
> between the partition keys.  Note that if it were possible for
> have_partkey_equi_join to return different answers for the same
> joinrel depending on which join ordering we try first, this logic
> would break. That shouldn't happen, though, because of the way the
> query planner deduces implied equalities.


Hmm. I meant the second para to be read in the context of the first.
Since AB is inner join A.x and B.x are replaceable (I forgot the
correct term, identity?) and thus A.x = C.x implies B.x = C.x thus
allowing join BC. But I think your version of the comment is easy to
understand. But I think it should also refer to the way planner
reorders joins; that's what causes us to worry about every join order
being partitioned. I think we should redirect a reader, who wants to
understand more about implied equalities and join orders, to
optimizer/README. So, I have changed the last sentence to read "That
shouldn't happen, though, because of the way the query planner deduces
implied equalities and reorders joins. See optimizer/README for
details." If you don't like my changes, please feel free to drop
those.

In the code block following this comment, I have used shorter variable
names instead of accurate but long ones. E.g. outer_expr should have
been outer_partexpr and outer_null_expr should have been
outer_nullable_partexpr. Please feel free to change those if you don't
like them or let me know if you have any better ideas and I will
update the patch with those ideas.

>
> +        * Join relation is partitioned using same partitioning scheme as the
> +        * joining relations and has same bounds.
>
> the same partitioning scheme

Done.

>
> +        * An INNER join between two partitioned relations is partitioned by key
> +        * expressions from both the relations. For tables A and B
> partitioned by
> +        * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by
> +        * both A.a and B.b.
> +        *
> +        * A SEMI/ANTI join only retains data from the outer side and is
> +        * partitioned by the partition keys of the outer side.
>
> I would write: An INNER join between two partitioned relations can be
> regarded as partitioned by either key expression.  For example, A
> INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
> B.b; they are equivalent.  For a SEMI or ANTI join, the result can
> only be regarded as being partitioned in the same manner as the outer
> side, since the inner columns are not retained.

Done.

>
> +        * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
> +        * B.b NULL. These rows may not fit the partitioning
> conditions imposed on
> +        * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>
> Good.
>
> +        * Strictly speaking, partition keys of an OUTER join should include
> +        * partition key expressions from the OUTER side only. Consider a join
>
> I would join this with the previous sentence instead of repeating
> strictly speaking: ...and thus the partition keys should include
> partition key expressions from the OUTER side only.  After that
> sentence, I'd skip a lot of the intermediate words here and continue
> this way: However, because all commonly-used comparison operators are
> strict, the presence of nulls on the outer side doesn't cause any
> problem; they can't match anything at future join levels anyway.
> Therefore, we track two sets of expressions: those that authentically
> partition the relation (partexprs) and those that partition the
> relation with the exception that extra nulls may be present
> (nullable_partexprs).  When the comparison operator is strict, the
> latter is just as good as the former.
>
> Then, I think you can omit the rest of what you have; it should be
> clear enough what's going on for the full and right cases given that
> explanation.

I liked this version. Changed the comments as per your suggestions.

>
> + * being joined. partexprs and nullable_partexprs are arrays
> containing part_scheme->partnatts
>
> Long line, needs reflowing.

Done. Also fixed a grammatical mistake: contains -> contain in the
last line of that paragraph.

>
> I don't think this is too far from being committable.  You've done
> some nice work here!
>

Thanks a lot for your detailed reviews and guidance. I will post the
updated patchset with my next reply.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Etsuro Fujita
Дата:
Сообщение: Re: [HACKERS] Another oddity in handling of WCO constraints inpostgres_fdw
Следующее
От: Etsuro Fujita
Дата:
Сообщение: Re: [HACKERS] Comment typo