Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Дата
Msg-id CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Ответы Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Список pgsql-hackers
On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/09 2:38, Ashutosh Bapat wrote:
>> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>>> I updated the patch to include just those changes.  I'm not sure about
>>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>>> to expand_partitioned_rtentry() to use as the parent PlanRowMark.  I think
>>> the child RTE, child RT index and child Relation are fine, because they
>>> are necessary for creating AppendRelInfos in a desired way for later
>>> planning steps.  But PlanRowMarks are not processed within the planner
>>> afterwards and do not need to be marked with the immediate parent-child
>>> association in the same way that AppendRelInfos need to be.
>>
>> Passing top parent's row mark works today, since there is no
>> parent-child specific translation happening there. But if in future,
>> we introduce such a translation, row marks for indirect children in a
>> multi-level partitioned hierarchy won't be accurate. So, I think it's
>> better to pass row marks of the direct parent.
>
> IMHO, we should make it the responsibility of the future patch to set a
> child PlanRowMark's prti to the direct parent's RT index, when we actually
> know that it's needed for something.  We clearly know today why we need to
> pass the other objects like child RT entry, RT index, and Relation, so we
> should limit this patch to pass only those objects to the recursive call.
> That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

>
>>> I also included the changes to add_paths_to_append_rel() from my patch on
>>> the "path toward faster partition pruning" thread.  We'd need that change,
>>> because while add_paths_to_append_rel() is called for all partitioned
>>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>>> have set up a PartitionedChildRelInfo only for the root parent, so
>>> get_partitioned_child_rels() would not find the same for non-root
>>> partitioned table rels and crash failing the Assert.  The changes I made
>>> are such that we call get_partitioned_child_rels() only for the parent
>>> rels that are known to correspond root partitioned tables (or as you
>>> pointed out on the thread, "the table named in the query" as opposed those
>>> added to the query as result of inheritance expansion).  In addition to
>>> the relkind check on the input RTE, it would seem that checking that the
>>> reloptkind is RELOPT_BASEREL would be enough.  But actually not, because
>>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>>> for the root partitioned table (the table named in the query) would be
>>> RELOPT_OTHER_MEMBER_REL.  The only way to confirm that the input rel is
>>> actually the root partitioned table is to check whether its parent rel is
>>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>>> RTE_SUBQUERY RT entry.
>>>
>>
>> There was a change in my 0003 patch, which fixed the crash. It checked
>> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
>> my 0001 patch. It no more crashes. I tried various queries involving
>> set operations and bare multi-level partitioned table scan with my
>> patch, but none of them showed any anomaly. Do you have a testcase
>> which shows problem with my patch? May be your suggestion is correct,
>> but corresponding code implementation is slightly longer than I would
>> expect. So, we should go with it, if there is corresponding testcase
>> which shows why it's needed.
>
> If we go with your patch, partitioned tables won't get locked, for
> example, in case of the following query (p is a partitioned table):
>
> select 1 from p union all select 2 from p;
>
> That's because the RelOptInfos for the two instances of p in the above
> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL.  They are children
> of the Append corresponding to the UNION ALL subquery RTE.  So,
> partitioned_rels does not get set per your proposed code.

Session 1:
postgres=# begin;
BEGIN
postgres=# select 1 from t1 union all select 2 from t1;?column?
----------
(0 rows)

postgres=# select pg_backend_pid();pg_backend_pid
----------------         28843
(1 row)


Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks; locktype  | relation | virtualxid | virtualtransaction
| pid  |mode       | granted | fastpath
 
------------+----------+------------+--------------------+-------+-----------------+---------+----------relation   |
pg_locks|            | 4/14               | 28854 |
 
AccessShareLock | t       | tvirtualxid |          | 4/14       | 4/14               | 28854 |
ExclusiveLock   | t       | trelation   | t1p1p1   |            | 3/9                | 28843 |
AccessShareLock | t       | trelation   | t1p1     |            | 3/9                | 28843 |
AccessShareLock | t       | trelation   | t1       |            | 3/9                | 28843 |
AccessShareLock | t       | tvirtualxid |          | 3/9        | 3/9                | 28843 |
ExclusiveLock   | t       | t
(6 rows)

So, all partitioned partitions are getting locked correctly. Am I
missing something?

>
>>
>> In your patch
>>
>> +            parent_rel = root->simple_rel_array[parent_relid];
>> +            get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
>>
>> Do you mean RTE_RELATION as you explained above?
>
> No, I mean RTE_SUBQUERY.
>
> If the partitioned table RTE in question corresponds to one named in the
> query, we should be able to find its pcinfo in root->pcinfo_list.  If the
> partitioned table RTE is one added as result of inheritance expansion, it
> won't have an associated pcinfo.  So, we should find a way to distinguish
> them from one another.  The first idea that had occurred to me was the
> same as yours -- RelOptInfo of the partitioned table RTE named in the
> query would be of reloptkind RELOPT_BASEREL and those of the partitioned
> table RTE added as result of inheritance expansion will be of reloptkind
> RELOPT_OTHER_MEMBER_REL.  Although the latter is always true, the former
> is not.  If the partitioned table named in the query appears under UNION
> ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL.  That
> means we have to use some other means to distinguish partitioned table
> RTEs that have an associated pcinfo from those that don't.  So, I devised
> this method of looking at the parent RTE (if any) for distinguishing the
> two.  Partitioned table named in the query either doesn't have the parent
> or if it does, the parent could only ever be a UNION ALL subquery
> (RTE_SUBQUERY).  Partitioned tables added as part of inheritance expansion
> will always have the parent and the parent will only ever be a table
> (RTE_RELATION).
>

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?

>
> In create_lateral_join_info():
>
> +        Assert(IS_SIMPLE_REL(brel));
> +        Assert(brte);
>
> The second Assert is either unnecessary or should be placed first.

simple_rte_array[] may have some NULL entries. Second assert makes
sure that we aren't dealing with a NULL entry. Any particular reason
to reorder the asserts?

>
> The following comment could be made a bit clearer.
>
> +         * In the case of table inheritance, the parent RTE is directly
> linked
> +         * to every child table via an AppendRelInfo.  In the case of table
> +         * partitioning, the inheritance hierarchy is expanded one level at a
> +         * time rather than flattened.  Therefore, an other member rel
> that is
> +         * a partitioned table may have children of its own, and must
> +         * therefore be marked with the appropriate lateral info so that
> those
> +         * children eventually get marked also.
>
> How about: In the case of partitioned table inheritance, the original
> parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
>  Partitions below the first level are accessible only via their immediate
> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
> consider those as well.

I don't see much difference between those two. We usually do not use
macros in comments, so usually comments mention "other member" rel.
Let's leave this for the committer to judge.

>
> In expand_inherited_rtentry(), the following comment fragment is obsolete,
> because we *do* now create AppendRelInfo's for partitioned children:
>
> +        /*
> +         * We keep a list of objects in root, each of which maps a
> partitioned
> +         * parent RT index to the list of RT indexes of its partitioned child
> +         * tables which do not have AppendRelInfos associated with those.

Good catch. I have reworded it as       /*        * We keep a list of objects in root, each of which maps a root
*partitioned parent RT index to the list of RT indexes of descendant        * partitioned child tables.
 

Does that look good?

>
>
> By the way, when we call expand_single_inheritance_child() in the
> non-partitioned inheritance case, we should pass NULL for childrte_p,
> childRTindex_p, childrc_p, instead of declaring variables that won't be
> used.  Hence, expand_single_inheritance_child() should make those
> arguments optional.

That introduces an extra "if" condition, which is costlier than an
assignment. We have used both the styles in the code. Previously, I
have got comments otherwise. So, I am not sure.

>
> +
> +    /*
> +     * If the partitioned table has no partitions or all the partitions are
> +     * temporary tables from other backends, treat this as non-inheritance
> +     * case.
> +     */
> +    if (!has_child)
> +        parentrte->inh = false;
>
> I guess the above applies to all partitioned tables in the tree, so, I
> think we should update the comment in set_rel_size():
>
>                 else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
>                 {
>                     /*
>                      * A partitioned table without leaf partitions is marked
>                      * as a dummy rel.
>                      */
>                     set_dummy_rel_pathlist(rel);
>                 }
>
> to say: a partitioned table without partitions is marked as a dummy rel.

Done. Thanks again for the catch.

I will update the patches once we have some resolution about 1. prti
in PlanRowMarks and 2. detection of root partitioned table in
add_paths_to_append_rel().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: [HACKERS] Polyphase merge is obsolete
Следующее
От: Ashutosh Bapat
Дата:
Сообщение: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables