Обсуждение: [HACKERS] path toward faster partition pruning

Поиск

Список

Период

Сортировка

[HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

21 августа 2017 г., 12:37:18

I've been working on implementing a way to perform plan-time
partition-pruning that is hopefully faster than the current method of
using constraint exclusion to prune each of the potentially many
partitions one-by-one. It's not fully cooked yet though.

Meanwhile, I thought I'd share a couple of patches that implement some
restructuring of the planner code related to partitioned table inheritance
planning that I think would be helpful. They are to be applied on top of
the patches being discussed at [1]. Note that these patches themselves
don't implement the actual code that replaces constraint exclusion as a
method of performing partition pruning. I will share that patch after
debugging it some more.

The main design goal of the patches I'm sharing here now is to defer the
locking and opening of leaf partitions in a given partition tree to a
point after set_append_rel_size() is called on the root partitioned table.
Currently, AFAICS, we need to lock and open the child tables in
expand_inherited_rtentry() only to set the translated_vars field in
AppendRelInfo that we create for the child. ISTM, we can defer the
creation of a child AppendRelInfo to a point when it (its field
translated_vars in particular) will actually be used and so lock and open
the child tables only at such a time. Although we don't lock and open the
partition child tables in expand_inherited_rtentry(), their RT entries are
still created and added to root->parse->rtable, so that
setup_simple_rel_arrays() knows the maximum number of entries
root->simple_rel_array will need to hold and allocate the memory for that
array accordingly. Slots in simple_rel_array[] corresponding to
partition child tables will be empty until they are created when
set_append_rel_size() is called on the root parent table and it determines
the partitions that will be scanned after all.

Patch augments the existing PartitionedChildRelInfo node, which currently
holds only the partitioned child rel RT indexes, to carry some more
information about the partition tree, which includes the information
returned by RelationGetPartitionDispatchInfo() when it is called from
expand_inherited_rtentry() (per the proposed patch in [1], we call it to
be able to add partitions to the query tree in the bound order).
Actually, since PartitionedChildRelInfo now contains more information
about the partition tree than it used to before, I thought the struct's
name is no longer relevant, so renamed it to PartitionRootInfo and renamed
root->pcinfo_list accordingly to prinfo_list. That seems okay because we
only use that node internally.

Then during the add_base_rels_to_query() step, when build_simple_rel()
builds a RelOptInfo for the root partitioned table, it also initializes
some newly introduced fields in RelOptInfo from the information contained
in PartitionRootInfo of the table. The aforementioned fields are only
initialized in RelOptInfos of root partitioned tables. Note that the
add_base_rels_to_query() step won't add the partition "otherrel"
RelOptInfos yet (unlike the regular inheritance case, where they are,
after looking them up in root->append_rel_list).

When set_append_rel_size() is called on the root partitioned table, it
will call a find_partitions_for_query(), which using the partition tree
information, determines the partitions that will need to be scanned for
the query. This processing happens recursively, that is, we first
determine the root-parent's partitions and then for each partition that's
partitioned, we will determine its partitions and so on. As we determine
partitions in this per-partitioned-table manner, we maintain a pair
(parent_relid, list-of-partition-relids-to-scan) for each partitioned
table and also a single list of all leaf partitions determined so far.
Once all partitions have been determined, we turn to locking the leaf
partitions. The locking happens in the order of OIDs as
find_all_inheritors would have returned in expand_inherited_rtentry(); the
list of OIDs in that original order is also stored in the table's
PartitionRootInfo node. For each OID in that list, check if that OID is
in the set of leaf partition OIDs that was just computed, and if so, lock
it. For all chosen partitions that are partitioned tables (including the
root), we create a PartitionAppendInfo node which stores the
aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
append it to a list in the root table's RelOptInfo, with the root table's
PartitionAppendInfo at the head of the list. Note that the list of
partitions in this pair contains only the immediate partitions, so that
the original parent-child relationship is reflected in the list of
PartitionAppendInfos thus collected. The next patch that will implement
actual partition-pruning will add some more code that will run under
find_partitions_for_query().

set_append_rel_size() processing then continues for the root partitioned
table. It is at this point that we will create the RelOptInfos and
AppendRelInfos for partitions. First for those of the root partitioned
table and then for those of each partitioned table when
set_append_rel_size() will be recursively called for the latter.

Note that this is still largely a WIP patch and the implementation details
might change per both the feedback here and the discussion at [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/befd7ec9-8f4c-6928-d330-ab05dbf860bf%40lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

23 августа 2017 г., 17:16:10

On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I've been working on implementing a way to perform plan-time
> partition-pruning that is hopefully faster than the current method of
> using constraint exclusion to prune each of the potentially many
> partitions one-by-one.  It's not fully cooked yet though.
>
> Meanwhile, I thought I'd share a couple of patches that implement some
> restructuring of the planner code related to partitioned table inheritance
> planning that I think would be helpful.  They are to be applied on top of
> the patches being discussed at [1].  Note that these patches themselves
> don't implement the actual code that replaces constraint exclusion as a
> method of performing partition pruning.  I will share that patch after
> debugging it some more.
>
> The main design goal of the patches I'm sharing here now is to defer the
> locking and  opening of leaf partitions in a given partition tree to a
> point after set_append_rel_size() is called on the root partitioned table.
>  Currently, AFAICS, we need to lock and open the child tables in
> expand_inherited_rtentry() only to set the translated_vars field in
> AppendRelInfo that we create for the child.  ISTM, we can defer the
> creation of a child AppendRelInfo to a point when it (its field
> translated_vars in particular) will actually be used and so lock and open
> the child tables only at such a time.  Although we don't lock and open the
> partition child tables in expand_inherited_rtentry(), their RT entries are
> still created and added to root->parse->rtable, so that
> setup_simple_rel_arrays() knows the maximum number of entries
> root->simple_rel_array will need to hold and allocate the memory for that
> array accordingly.   Slots in simple_rel_array[] corresponding to
> partition child tables will be empty until they are created when
> set_append_rel_size() is called on the root parent table and it determines
> the partitions that will be scanned after all.

The partition pruning can happen only after the quals have been
distributed to Rels i.e. after deconstruct_jointree(),
reconsider_outer_join_clauses() and generate_base_implied_equalities()
have been called. If the goal is to not heap_open() the partitions
which are pruned, we can't do that in expand_inherited_rtentry(). One
reason why I think we don't want to heap_open() partition relations is
to avoid relcache bloat because of opened partition relations, which
are ultimately pruned. But please note that according to your patches,
we still need to populate catalog caches to get relkind and reltype
etc.

There are many functions that traverse simple_rel_array[] after it's
created. Most of them assume that the empty entries in that array
correspond to non-simple range entries like join RTEs. But now we are
breaking that assumption. Most of these functions also skip "other"
relations, so that may be OK now, but I am not sure if it's really
going to be fine if we keep empty slots in place of partition
relations. There may be three options here 1. add placeholder
RelOptInfos for partition relations (may be mark those specially) and
mark the ones which get pruned as dummy later. 2. Prune the partitions
before any functions scans simple_rel_array[] or postpone creating
simple_rel_array till pruning. 3. Examine all the current scanners
esp. the ones which will be called before pruning to make sure that
skipping "other" relations is going to be kosher.

>
> Patch augments the existing PartitionedChildRelInfo node, which currently
> holds only the partitioned child rel RT indexes, to carry some more
> information about the partition tree, which includes the information
> returned by RelationGetPartitionDispatchInfo() when it is called from
> expand_inherited_rtentry() (per the proposed patch in [1], we call it to
> be able to add partitions to the query tree in the bound order).
> Actually, since PartitionedChildRelInfo now contains more information
> about the partition tree than it used to before, I thought the struct's
> name is no longer relevant, so renamed it to PartitionRootInfo and renamed
> root->pcinfo_list accordingly to prinfo_list.  That seems okay because we
> only use that node internally.
>
> Then during the add_base_rels_to_query() step, when build_simple_rel()
> builds a RelOptInfo for the root partitioned table, it also initializes
> some newly introduced fields in RelOptInfo from the information contained
> in PartitionRootInfo of the table.  The aforementioned fields are only
> initialized in RelOptInfos of root partitioned tables.  Note that the
> add_base_rels_to_query() step won't add the partition "otherrel"
> RelOptInfos yet (unlike the regular inheritance case, where they are,
> after looking them up in root->append_rel_list).

Partition-wise join requires the partition hierarchy to be expanded
level-by-level keeping in-tact the parent-child relationship between
partitioned table and its partitions. Your patch doesn't do that and
adds all the partitioning information in the root partitioned table's
RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key
expressions, OID and RelOptInfos of the immediate partitions
(including partitioned partitions) to RelOptInfo of a partitioned
table (see patch 0002 in the latest set of patches at [1]). I don't
see much point in having conflicting changes in both of our patches.
May be you should review that patch from my set  and we can find a set
of members which help both partition pruning and partition-wise join.

>
> When set_append_rel_size() is called on the root partitioned table, it
> will call a find_partitions_for_query(), which using the partition tree
> information, determines the partitions that will need to be scanned for
> the query.  This processing happens recursively, that is, we first
> determine the root-parent's partitions and then for each partition that's
> partitioned, we will determine its partitions and so on.  As we determine
> partitions in this per-partitioned-table manner, we maintain a pair
> (parent_relid, list-of-partition-relids-to-scan) for each partitioned
> table and also a single list of all leaf partitions determined so far.
> Once all partitions have been determined, we turn to locking the leaf
> partitions.  The locking happens in the order of OIDs as
> find_all_inheritors would have returned in expand_inherited_rtentry(); the
> list of OIDs in that original order is also stored in the table's
> PartitionRootInfo node.  For each OID in that list, check if that OID is
> in the set of leaf partition OIDs that was just computed, and if so, lock
> it.  For all chosen partitions that are partitioned tables (including the
> root), we create a PartitionAppendInfo node which stores the
> aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
> append it to a list in the root table's RelOptInfo, with the root table's
> PartitionAppendInfo at the head of the list.  Note that the list of
> partitions in this pair contains only the immediate partitions, so that
> the original parent-child relationship is reflected in the list of
> PartitionAppendInfos thus collected.  The next patch that will implement
> actual partition-pruning will add some more code that will run under
> find_partitions_for_query().
>
> set_append_rel_size() processing then continues for the root partitioned
> table.  It is at this point that we will create the RelOptInfos and
> AppendRelInfos for partitions.  First for those of the root partitioned
> table and then for those of each partitioned table when
> set_append_rel_size() will be recursively called for the latter.

set_append_rel_size(), set_append_rel_pathlist() are already
recursive, so if we process expansion and pruning for one level in
those functions, the recursion will automatically take care of doing
so for every level.

>
>
> Note that this is still largely a WIP patch and the implementation details
> might change per both the feedback here and the discussion at [1].

The changes to code which handle expansion in this patch set should
really be part of expansion in bound order thread so that it's easy to
review all changes together. And this thread can then only concentrate
on partition pruning.

[1] http://postgr.es/m/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

31 августа 2017 г., 09:17:23

Hi Ashutosh,

Thanks for the comments and sorry that it took me a while to reply here.

On 2017/08/23 20:16, Ashutosh Bapat wrote:
> On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I've been working on implementing a way to perform plan-time
>> partition-pruning that is hopefully faster than the current method of
>> using constraint exclusion to prune each of the potentially many
>> partitions one-by-one.  It's not fully cooked yet though.
>>
>> Meanwhile, I thought I'd share a couple of patches that implement some
>> restructuring of the planner code related to partitioned table inheritance
>> planning that I think would be helpful.  They are to be applied on top of
>> the patches being discussed at [1].  Note that these patches themselves
>> don't implement the actual code that replaces constraint exclusion as a
>> method of performing partition pruning.  I will share that patch after
>> debugging it some more.
>>
>> The main design goal of the patches I'm sharing here now is to defer the
>> locking and  opening of leaf partitions in a given partition tree to a
>> point after set_append_rel_size() is called on the root partitioned table.
>>  Currently, AFAICS, we need to lock and open the child tables in
>> expand_inherited_rtentry() only to set the translated_vars field in
>> AppendRelInfo that we create for the child.  ISTM, we can defer the
>> creation of a child AppendRelInfo to a point when it (its field
>> translated_vars in particular) will actually be used and so lock and open
>> the child tables only at such a time.  Although we don't lock and open the
>> partition child tables in expand_inherited_rtentry(), their RT entries are
>> still created and added to root->parse->rtable, so that
>> setup_simple_rel_arrays() knows the maximum number of entries
>> root->simple_rel_array will need to hold and allocate the memory for that
>> array accordingly.   Slots in simple_rel_array[] corresponding to
>> partition child tables will be empty until they are created when
>> set_append_rel_size() is called on the root parent table and it determines
>> the partitions that will be scanned after all.
> 
> The partition pruning can happen only after the quals have been
> distributed to Rels i.e. after deconstruct_jointree(),
> reconsider_outer_join_clauses() and generate_base_implied_equalities()
> have been called. If the goal is to not heap_open() the partitions
> which are pruned, we can't do that in expand_inherited_rtentry(). One
> reason why I think we don't want to heap_open() partition relations is
> to avoid relcache bloat because of opened partition relations, which
> are ultimately pruned. But please note that according to your patches,
> we still need to populate catalog caches to get relkind and reltype
> etc.

Yes, we still hit syscache for *all* partitions.  I haven't yet thought
very hard about avoiding that altogether.

> There are many functions that traverse simple_rel_array[] after it's
> created. Most of them assume that the empty entries in that array
> correspond to non-simple range entries like join RTEs. But now we are
> breaking that assumption. Most of these functions also skip "other"
> relations, so that may be OK now, but I am not sure if it's really
> going to be fine if we keep empty slots in place of partition
> relations. There may be three options here 1. add placeholder
> RelOptInfos for partition relations (may be mark those specially) and
> mark the ones which get pruned as dummy later. 2. Prune the partitions
> before any functions scans simple_rel_array[] or postpone creating
> simple_rel_array till pruning. 3. Examine all the current scanners
> esp. the ones which will be called before pruning to make sure that
> skipping "other" relations is going to be kosher.

Between the point when slots in simple_rel_array are allocated
(setup_simple_rel_arrays) and partition RelOptInfos are actually created
after the partition-pruning step has occurred (set_append_rel_size), it
seems that most places that iterate over simple_rel_array know also to
skip slots containing NULL values.  We might need to document that NULL
means partitions in addition to its current meaning - non-baserels.

>> Patch augments the existing PartitionedChildRelInfo node, which currently
>> holds only the partitioned child rel RT indexes, to carry some more
>> information about the partition tree, which includes the information
>> returned by RelationGetPartitionDispatchInfo() when it is called from
>> expand_inherited_rtentry() (per the proposed patch in [1], we call it to
>> be able to add partitions to the query tree in the bound order).
>> Actually, since PartitionedChildRelInfo now contains more information
>> about the partition tree than it used to before, I thought the struct's
>> name is no longer relevant, so renamed it to PartitionRootInfo and renamed
>> root->pcinfo_list accordingly to prinfo_list.  That seems okay because we
>> only use that node internally.
>>
>> Then during the add_base_rels_to_query() step, when build_simple_rel()
>> builds a RelOptInfo for the root partitioned table, it also initializes
>> some newly introduced fields in RelOptInfo from the information contained
>> in PartitionRootInfo of the table.  The aforementioned fields are only
>> initialized in RelOptInfos of root partitioned tables.  Note that the
>> add_base_rels_to_query() step won't add the partition "otherrel"
>> RelOptInfos yet (unlike the regular inheritance case, where they are,
>> after looking them up in root->append_rel_list).
> 
> Partition-wise join requires the partition hierarchy to be expanded
> level-by-level keeping in-tact the parent-child relationship between
> partitioned table and its partitions. Your patch doesn't do that and
> adds all the partitioning information in the root partitioned table's
> RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key
> expressions, OID and RelOptInfos of the immediate partitions
> (including partitioned partitions) to RelOptInfo of a partitioned
> table (see patch 0002 in the latest set of patches at [1]). I don't
> see much point in having conflicting changes in both of our patches.
> May be you should review that patch from my set  and we can find a set
> of members which help both partition pruning and partition-wise join.

Yes, I think it would be a good idea for the partition-pruning patch to
initialize those fields in the individual parents' RelOptInfos.  I will
review relevant patches in the partitionwise-join thread to see what can
be incorporated here.

>> When set_append_rel_size() is called on the root partitioned table, it
>> will call a find_partitions_for_query(), which using the partition tree
>> information, determines the partitions that will need to be scanned for
>> the query.  This processing happens recursively, that is, we first
>> determine the root-parent's partitions and then for each partition that's
>> partitioned, we will determine its partitions and so on.  As we determine
>> partitions in this per-partitioned-table manner, we maintain a pair
>> (parent_relid, list-of-partition-relids-to-scan) for each partitioned
>> table and also a single list of all leaf partitions determined so far.
>> Once all partitions have been determined, we turn to locking the leaf
>> partitions.  The locking happens in the order of OIDs as
>> find_all_inheritors would have returned in expand_inherited_rtentry(); the
>> list of OIDs in that original order is also stored in the table's
>> PartitionRootInfo node.  For each OID in that list, check if that OID is
>> in the set of leaf partition OIDs that was just computed, and if so, lock
>> it.  For all chosen partitions that are partitioned tables (including the
>> root), we create a PartitionAppendInfo node which stores the
>> aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
>> append it to a list in the root table's RelOptInfo, with the root table's
>> PartitionAppendInfo at the head of the list.  Note that the list of
>> partitions in this pair contains only the immediate partitions, so that
>> the original parent-child relationship is reflected in the list of
>> PartitionAppendInfos thus collected.  The next patch that will implement
>> actual partition-pruning will add some more code that will run under
>> find_partitions_for_query().
>>
>> set_append_rel_size() processing then continues for the root partitioned
>> table.  It is at this point that we will create the RelOptInfos and
>> AppendRelInfos for partitions.  First for those of the root partitioned
>> table and then for those of each partitioned table when
>> set_append_rel_size() will be recursively called for the latter.
> 
> set_append_rel_size(), set_append_rel_pathlist() are already
> recursive, so if we process expansion and pruning for one level in
> those functions, the recursion will automatically take care of doing
> so for every level.

My only worry about that is locking order of leaf partitions will be
different from concurrent backends if we lock them in an order dictated by
traversing the partition tree level at a time.  Because such traversal
will presumably proceed in the partition bound order.

The patch computes *all* leaf partitions that will need to be scanned by
the query (after pruning needless ones) and lock the chosen partitions in
the original order (it keeps the original order OID list generated by
find_all_inheritors around for this purpose).  While computing the leaf
partitions, it remembers immediate parent-child relationships in the
process.  The way it does it is by processing the partition tree in a
recursive depth-first manner, and in each recursive step, creating a
PartitionAppendInfo that maps a parent table to its immediate children
(only those that will satisfy the query).  Once the PartitionAppendInfo's
for all the parents in the partition tree have been computed, we resume
the set_append_rel_size() processing, which takes the root parent's
PartitionAppendInfo and builds RelOptInfos and AppendRelInfos for its
immediate children.  Its children that are partitioned tables themselves
will recursively call set_append_rel_size() and look up its own
PartitionAppendInfo and create RelOptInfos and AppendRelInfos for its
children and so on.

>> Note that this is still largely a WIP patch and the implementation details
>> might change per both the feedback here and the discussion at [1].
> 
> The changes to code which handle expansion in this patch set should
> really be part of expansion in bound order thread so that it's easy to
> review all changes together. And this thread can then only concentrate
> on partition pruning.

I think I agree.  I'm posting today the patches that actually implement
partition-pruning.  The previous patches do seem to belong on the EIBO
thread, but will post them together here today for convenience of being
able to apply them to HEAD and try out.  Now that Robert has posted a
patch to implement depth-first EIBO, I will have to find a way to rebase
the actual partition-pruning patches on this thread so that its core logic
works at all by finding the information it needs.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

31 августа 2017 г., 12:02:48

On 2017/08/21 15:37, Amit Langote wrote:
> Meanwhile, I thought I'd share a couple of patches that implement some
> restructuring of the planner code related to partitioned table inheritance
> planning that I think would be helpful.  They are to be applied on top of
> the patches being discussed at [1].  Note that these patches themselves
> don't implement the actual code that replaces constraint exclusion as a
> method of performing partition pruning.  I will share that patch after
> debugging it some more.
>
> The next patch that will implement
> actual partition-pruning will add some more code that will run under
> find_partitions_for_query().

Attached is now also the set of patches that implement the actual
partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of
the attached.

Because the patch helps avoid performing constraint exclusion on *all*
partitions for a given query, one might expect this to improve performance
for queries on partitioned tables and scale to a fairly large number of
partitions.  Here are some numbers for the partitioned table and the query
shown below:

\d+ ptab
Columns: (a date, b int, c text)
Partition key: RANGE (a, b)
Partitions:
ptab_00001 FOR VALUES FROM ('2017-08-31', 1) TO ('2017-08-31', 1000),
ptab_00002 FOR VALUES FROM ('2017-08-31', 1000) TO ('2017-08-31', 2000),
ptab_00003 FOR VALUES FROM ('2017-08-31', 2000) TO ('2017-08-31', 3000),
ptab_00004 FOR VALUES FROM ('2017-08-31', 3000) TO ('2017-08-31', 4000),
ptab_00005 FOR VALUES FROM ('2017-08-31', 4000) TO ('2017-08-31', 5000),

ptab_00006 FOR VALUES FROM ('2017-09-01', 1) TO ('2017-09-01', 1000),
ptab_00007 FOR VALUES FROM ('2017-09-01', 1000) TO ('2017-09-01', 2000),

...
ptab_NNNNN FOR VALUES FROM (..., 4000) TO (..., 5000),

A query that prunes all partitions (empty result!):

explain select * from ptab where a < '2017-08-31';

Comparison of the average response times (in milliseconds) after running
the same query 100 times using pgbench against the database:

  #: Number of partitions of ptab
c_e: Constraint exclusion
f_p: Fast pruning

    #      c_e      f_p
=====    =====     ====
   10      0.7      0.4
   50      1.8      0.6
  100      3.2      0.8
  500     16.8      2.7
 1000     36.2      5.0
 2000     79.7     10.2
 5000    214.7     27.0
10000    443.6     64.8

For each partitioned table in a given partition tree (provided it is not
pruned to begin with), the query's clauses are matched to its partition
key and from the matched clauses, a pair of bounding keys (Datum tuples
with <= key->partnatts values for possibly a prefix of a multi-column key)
is generated.  They are passed to partition.c: get_partitions_for_keys()
as Datum *minkeys and Datum *maxkeys.  A list of partitions covering that
key range is returned.  When generating that list, whether a particular
scan key is inclusive or not is considered along with the partitioning
strategy.  It should be possible to support hash partitioning with
(hopefully) minimal changes to get_partitions_for_keys().

There are still certain limitations on the planner side of things:

1. OR clauses are not counted as contributing toward bounding scan keys;
   currently only OpExprs and NullTests are recognized, so an OR clause
   would get skipped from consideration when generating the bounding keys
   to pass to partition.c

2. Redundant clauses are not appropriately pre-processed; so if a query
   contains a = 10 and a > 1, the latter clause will be matched and
   partitions holding values with a > 1 and a < 10 will not be pruned,
   even if none of their rows will pass the query's condition

Fixing these limitations, adding more regression tests and implementing
some of the things suggested by Ashutosh Bapat [1] to prevent conflicting
changes with some preparatory patches in the partitionwise-join patch
series [2] are TODOs.

Adding this to CF-201709 as "faster partition pruning in planner".

To try out the attached patches: apply the patches posted at [3] on HEAD
and then apply these

Thanks,
Amit

[1]
https://postgr.es/m/CAFjFpRdb_fkmJHFjvAbB%2BLn0t45fWjekLd5pY%3Dsv%2BeAhBAKXPQ%40mail.gmail.com

[2]
https://postgr.es/m/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com

[3]
https://www.postgresql.org/message-id/2124e99f-9528-0f71-4e10-ac7974dd7077%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

01 сентября 2017 г., 23:52:48

On Thu, Aug 31, 2017 at 2:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is now also the set of patches that implement the actual
> partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of
> the attached.

It strikes me that this patch set is doing two things but maybe in the
opposite order that I would have chosen to attack them.  First,
there's getting partition pruning to use something other than
constraint exclusion.  Second, there's deferring work that is
currently done at an early stage of the process until later, so that
we waste less effort on partitions that are ultimately going to be
pruned.

The second one is certainly a worthwhile goal, but there are fairly
firm interdependencies between the first one and some other things
that are in progress.  For example, the first one probably ought to be
done before hash partitioning gets committed, because
constraint-exclusion based partitioning pruning won't work with
partitioning pruning, but some mechanism based on asking the
partitioning code which partitions might match will.  Such a mechanism
is more efficient for list and range partitions, but it's the only
thing that will work for hash partitions.  Also, Beena Emerson is
working on run-time partition pruning, and the more I think about it,
the more I think that overlaps with this first part.  Both patches
need a mechanism to identify, given a btree-indexable comparison
operator (< > <= >= =) and a set of values, which partitions might
contain matching values.  Run-time partition pruning will call that at
execution time, and this patch will call it at plan time, but it's the
same logic; it's just a question of the point at which the values are
known.  And of course we don't want to end up with two copies of the
logic.

Therefore, IMHO, it would be best to focus first on how we're going to
identify the partitions that survive pruning, and then afterwards work
on transposing that logic to happen before partitions are opened and
locked.  That way, we get some incremental benefit sooner, and also
unblock some other development work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

04 сентября 2017 г., 07:10:41

Thanks for the comments.

On 2017/09/02 2:52, Robert Haas wrote:
> On Thu, Aug 31, 2017 at 2:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is now also the set of patches that implement the actual
>> partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of
>> the attached.
> 
> It strikes me that this patch set is doing two things but maybe in the
> opposite order that I would have chosen to attack them.  First,
> there's getting partition pruning to use something other than
> constraint exclusion.  Second, there's deferring work that is
> currently done at an early stage of the process until later, so that
> we waste less effort on partitions that are ultimately going to be
> pruned.

OK.

> 
> The second one is certainly a worthwhile goal, but there are fairly
> firm interdependencies between the first one and some other things
> that are in progress.  For example, the first one probably ought to be
> done before hash partitioning gets committed, because
> constraint-exclusion based partitioning pruning won't work with
> partitioning pruning, but some mechanism based on asking the
> partitioning code which partitions might match will.

Yeah.

> Such a mechanism
> is more efficient for list and range partitions, but it's the only
> thing that will work for hash partitions.  Also, Beena Emerson is
> working on run-time partition pruning, and the more I think about it,
> the more I think that overlaps with this first part.  Both patches
> need a mechanism to identify, given a btree-indexable comparison
> operator (< > <= >= =) and a set of values, which partitions might
> contain matching values.  Run-time partition pruning will call that at
> execution time, and this patch will call it at plan time, but it's the
> same logic; it's just a question of the point at which the values are
> known.  And of course we don't want to end up with two copies of the
> logic.

Agreed here too.

I agree that spending effort on the first part (deferment of locking, etc.
within the planner) does not benefit either the hash partitioning and
run-time pruning patches much.

> Therefore, IMHO, it would be best to focus first on how we're going to
> identify the partitions that survive pruning, and then afterwards work
> on transposing that logic to happen before partitions are opened and
> locked.  That way, we get some incremental benefit sooner, and also
> unblock some other development work.

Alright, I will try to do it that way.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

06 сентября 2017 г., 16:38:08

On 2017/09/04 10:10, Amit Langote wrote:
> On 2017/09/02 2:52, Robert Haas wrote:
>> It strikes me that this patch set is doing two things but maybe in the
>> opposite order that I would have chosen to attack them.  First,
>> there's getting partition pruning to use something other than
>> constraint exclusion.  Second, there's deferring work that is
>> currently done at an early stage of the process until later, so that
>> we waste less effort on partitions that are ultimately going to be
>> pruned.
> 
> OK.
> 
>> Therefore, IMHO, it would be best to focus first on how we're going to
>> identify the partitions that survive pruning, and then afterwards work
>> on transposing that logic to happen before partitions are opened and
>> locked.  That way, we get some incremental benefit sooner, and also
>> unblock some other development work.
> 
> Alright, I will try to do it that way.

Attached set of patches that does things that way.  Individual patches
described below:

[PATCH 1/5] Expand partitioned inheritance in a non-flattened manner

This will allow us to perform scan and join planning in a per partition
sub-tree manner, with each sub-tree's root getting its own RelOptInfo.
Previously, only the root of the whole partition tree would get a
RelOptInfo, along with the leaf partitions, with each leaf partition's
AppendRelInfo pointing to the former as its parent.

This is essential, because we will be doing partition-pruning for every
partitioned table in the tree by matching query's scan keys with its
partition key.  We won't be able to do that if the intermediate
partitioned tables didn't have a RelOptInfo.

[PATCH 2/5] WIP: planner-side changes for partition-pruning

This patch adds a stub get_partitions_for_keys in partition.c with a
suitable interface for the caller to pass bounding keys extracted from the
query and other related information.

Importantly, it implements the logic within the planner to match query's
scan keys to a parent table's partition key and form the bounding keys
that will be passed to partition.c to compute the list of partitions that
satisfy those bounds.

Also, it adds certain fields to RelOptInfos of the partitioned tables that
reflect its partitioning properties.

[PATCH 3/5] WIP: Interface changes for partition_bound_{cmp/bsearch}

This guy modifies the partition bound comparison function so that the
caller can pass incomplete partition key tuple that is potentially a
prefix of a multi-column partition key.  Currently, the input tuple must
contain all of key->partnatts values, but that may not be true for
queries, which may not have restrictions on all the partition key columns.

[PATCH 4/5] WIP: Implement get_partitions_for_keys()

This one fills the get_partitions_for_keys stub with the actual logic that
searches the partdesc->boundinfo for partition bounds that match the
bounding keys specified by the caller.

[PATCH 5/5] Add more tests for the new partitioning-related planning

More tests.

Some TODO items still remain to be done:

* Process OR clauses to use for partition-pruning
* Process redundant clauses (such as a = 10 and a > 1) more smartly
* Other tricks that are missing
* Fix bugs
* More tests

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2017/09/15 11:16, Amit Langote wrote:
> I will post rebased patches later today, although I think the overall
> design of the patch on the planner side of things is not quite there yet.
> Of course, your and others' feedback is greatly welcome.

Rebased patches attached.  Because Dilip complained earlier today about
clauses of the form (const op var) not causing partition-pruning, I've
added code to commute the clause where it is required.  Some other
previously mentioned limitations remain -- no handling of OR clauses, no
elimination of redundant clauses for given partitioning column, etc.

A note about 0001: this patch overlaps with
0003-Canonical-partition-scheme.patch from the partitionwise-join patch
series that Ashutosh Bapat posted yesterday [1].  Because I implemented
the planner-portion of this patch based on what 0001 builds, I'm posting
it here.  It might actually turn out that we will review and commit
0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
0001 if you want to play with the later patches.  I would certainly like
to review  0003-Canonical-partition-scheme.patch myself, but won't be able
to immediately (see below).

> Also, I must inform to all of those who're looking at this thread that I
> won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23,
> Sat, due to some personal business.

To remind.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-skmaqeCVaoAHCBqe2DyfO3f6sgdtEjHWrUgi0kV1yPLQ%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

I have done some refactoring of the code where I have moved the code
of getting the matching clause into the separate function so that it
can fetch the matching clause from any set of given restriction list.

It can be applied on top of 0002-WIP:
planner-side-changes-for-partition-pruning.patch

On Sat, Sep 16, 2017 at 3:13 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>
> Thanks for the updated patch.  I was going through the logic of
> get_rel_partitions in 0002 as almost similar functionality will be
> required by runtime partition pruning on which Beena is working.  The
> only difference is that here we are processing the
> "rel->baserestrictinfo" and in case of runtime pruning, we also need
> to process join clauses which are pushed down to appendrel.
>
> So can we make some generic logic which can be used for both the patches.
>
> So basically, we need to do two changes
>
> 1. In get_rel_partitions instead of processing the
> "rel->baserestrictinfo" we can take clause list as input that way we
> can pass any clause list to this function.
>
> 2. Don't call "get_partitions_for_keys" routine from the
> "get_rel_partitions", instead, get_rel_partitions can just prepare
> minkey, maxkey and the caller of the get_rel_partitions can call
> get_partitions_for_keys, because for runtime pruning we need to call
> get_partitions_for_keys at runtime.
>
> After these changes also there will be one problem that the
> get_partitions_for_keys is directly fetching the "rightop->constvalue"
> whereas, for runtime pruning, we need to store rightop itself and
> calculate the value at runtime by param evaluation,  I haven't yet
> thought how can we make this last part generic.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

0002-refactor_get_rel_partition.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

25 сентября 2017 г., 16:04:36

Hi Dilip.

Thanks for looking at the patches and the comments.

On 2017/09/16 18:43, Dilip Kumar wrote:
> On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
> 
> Thanks for the updated patch.  I was going through the logic of
> get_rel_partitions in 0002 as almost similar functionality will be
> required by runtime partition pruning on which Beena is working.  The
> only difference is that here we are processing the
> "rel->baserestrictinfo" and in case of runtime pruning, we also need
> to process join clauses which are pushed down to appendrel.

Yeah, I agree with the point you seem to be making that
get_rel_partitions() covers a lot of functionality, which it would be nice
to break down into reusable function(s) with suitable signature(s) that
the executor will also be able to use.

Your proposed refactoring patch down-thread seems to be a good step in
that direction.  Thanks for working on it.

> So can we make some generic logic which can be used for both the patches.
> 
> So basically, we need to do two changes
> 
> 1. In get_rel_partitions instead of processing the
> "rel->baserestrictinfo" we can take clause list as input that way we
> can pass any clause list to this function.
> 
> 2. Don't call "get_partitions_for_keys" routine from the
> "get_rel_partitions", instead, get_rel_partitions can just prepare
> minkey, maxkey and the caller of the get_rel_partitions can call
> get_partitions_for_keys, because for runtime pruning we need to call
> get_partitions_for_keys at runtime.

It's not clear to me whether get_rel_partitions() itself, as it is, is
callable from outside the planner, because its signature contains
RelOptInfo.  We have the RelOptInfo in the signature, because we want to
mark certain fields in it so that latter planning steps can use them.  So,
get_rel_partitions()'s job is not just to match clauses and find
partitions, but also to perform certain planner-specific tasks of
generating information that the later planning steps will want to use.
That may turn out to be unnecessary, but until we know that, let's not try
to export get_rel_partitions() itself out of the planner.

OTOH, the function that your refactoring patch separates out to match
clauses to partition keys and extract bounding values seems reusable
outside the planner and we should export it in such a way that it can be
used in the executor.  Then, the hypothetical executor function that does
the pruning will first call the planner's clause-matching function,
followed by calling get_partitions_for_keys() in partition.c to get the
selected partitions.

We should be careful when designing the interface of the exported function
to make sure it's not bound to the planner.  Your patch still maintains
the RelOptInfo in the signature of the clause-matching function, which the
executor pruning function won't have access to.

> After these changes also there will be one problem that the
> get_partitions_for_keys is directly fetching the "rightop->constvalue"
> whereas, for runtime pruning, we need to store rightop itself and
> calculate the value at runtime by param evaluation,  I haven't yet
> thought how can we make this last part generic.

I don't think any code introduced by the patch in partition.c itself looks
inside OpExpr (or any type of clause for that matter).  That is, I don't
see where get_partitions_for_keys() is looking at rightop->constvalue.
All it receives to work with are arrays of Datums and some other relevant
information like inclusivity, nullness, etc.

By the way, I'm now rebasing these patches on top of [1] and will try to
merge your refactoring patch in some appropriate way.  Will post more
tomorrow.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Dilip Kumar

Дата:

25 сентября 2017 г., 17:21:52

On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

> Thanks for looking at the patches and the comments.

> It's not clear to me whether get_rel_partitions() itself, as it is, is
> callable from outside the planner, because its signature contains
> RelOptInfo.  We have the RelOptInfo in the signature, because we want to
> mark certain fields in it so that latter planning steps can use them.  So,
> get_rel_partitions()'s job is not just to match clauses and find
> partitions, but also to perform certain planner-specific tasks of
> generating information that the later planning steps will want to use.
> That may turn out to be unnecessary, but until we know that, let's not try
> to export get_rel_partitions() itself out of the planner.
>
> OTOH, the function that your refactoring patch separates out to match
> clauses to partition keys and extract bounding values seems reusable
> outside the planner and we should export it in such a way that it can be
> used in the executor.  Then, the hypothetical executor function that does
> the pruning will first call the planner's clause-matching function,
> followed by calling get_partitions_for_keys() in partition.c to get the
> selected partitions.

Thanks for your reply.

Actually, we are still planning to call get_matching_clause at the
optimizer time only.  Since we can not use get_rel_partitions function
directly for runtime pruning because it does all the work (find
matching clause, create minkey and maxkey and call
get_partitions_for_keys) during planning time itself.

For runtime pruning, we are planning to first get_matching_clause
function during optimizer time to identify the clause which is
matching with partition keys, but for PARAM_EXEC case we can not
depend upon baserelrestriction instead we will get the from join
clause, that's the reason I have separated out get_matching_clause.
But it will still be used during planning time.

After separating out the matching clause we will do somewhat similar
processing what "get_rel_partitions" is doing. But, at optimizer time
for PARAM we will not have Datum values for rightop, so we will keep
track of the PARAM itself.

And, finally at runtime when we get the PARAM value we can prepare
minkey and maxkey and call get_partitions_for_keys function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

26 сентября 2017 г., 15:15:00

On 2017/09/25 20:21, Dilip Kumar wrote:
> On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
>> Thanks for looking at the patches and the comments.
> 
>> It's not clear to me whether get_rel_partitions() itself, as it is, is
>> callable from outside the planner, because its signature contains
>> RelOptInfo.  We have the RelOptInfo in the signature, because we want to
>> mark certain fields in it so that latter planning steps can use them.  So,
>> get_rel_partitions()'s job is not just to match clauses and find
>> partitions, but also to perform certain planner-specific tasks of
>> generating information that the later planning steps will want to use.
>> That may turn out to be unnecessary, but until we know that, let's not try
>> to export get_rel_partitions() itself out of the planner.
>>
>> OTOH, the function that your refactoring patch separates out to match
>> clauses to partition keys and extract bounding values seems reusable
>> outside the planner and we should export it in such a way that it can be
>> used in the executor.  Then, the hypothetical executor function that does
>> the pruning will first call the planner's clause-matching function,
>> followed by calling get_partitions_for_keys() in partition.c to get the
>> selected partitions.
> 
> Thanks for your reply.
> 
> Actually, we are still planning to call get_matching_clause at the
> optimizer time only.  Since we can not use get_rel_partitions function
> directly for runtime pruning because it does all the work (find
> matching clause, create minkey and maxkey and call
> get_partitions_for_keys) during planning time itself.
>
> For runtime pruning, we are planning to first get_matching_clause
> function during optimizer time to identify the clause which is
> matching with partition keys, but for PARAM_EXEC case we can not
> depend upon baserelrestriction instead we will get the from join
> clause, that's the reason I have separated out get_matching_clause.
> But it will still be used during planning time.

I see.  So, in the run-time pruning case, only the work of extracting
bounding values is deferred to execution time.  Matching clauses with the
partition key still occurs during planning time.  Only that the clauses
that require run-time pruning are not those in rel->baserestrictinfo.

> After separating out the matching clause we will do somewhat similar
> processing what "get_rel_partitions" is doing. But, at optimizer time
> for PARAM we will not have Datum values for rightop, so we will keep
> track of the PARAM itself.

I guess information about which PARAMs map to which partition keys will be
kept in the plan somehow.

> And, finally at runtime when we get the PARAM value we can prepare
> minkey and maxkey and call get_partitions_for_keys function.

Note that get_partitions_for_keys() is not planner code, nor is it bound
with any other planning code.  It's callable from executor without much
change.  Maybe you already know that though.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Dilip Kumar

Дата:

26 сентября 2017 г., 18:45:31

On Tue, Sep 26, 2017 at 2:45 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/25 20:21, Dilip Kumar wrote:

> I see.  So, in the run-time pruning case, only the work of extracting
> bounding values is deferred to execution time.  Matching clauses with the
> partition key still occurs during planning time.  Only that the clauses
> that require run-time pruning are not those in rel->baserestrictinfo.

Right.
>
>> After separating out the matching clause we will do somewhat similar
>> processing what "get_rel_partitions" is doing. But, at optimizer time
>> for PARAM we will not have Datum values for rightop, so we will keep
>> track of the PARAM itself.
>
> I guess information about which PARAMs map to which partition keys will be
> kept in the plan somehow.

Yes.
>
>> And, finally at runtime when we get the PARAM value we can prepare
>> minkey and maxkey and call get_partitions_for_keys function.
>
> Note that get_partitions_for_keys() is not planner code, nor is it bound
> with any other planning code.  It's callable from executor without much
> change.  Maybe you already know that though.

Yes, Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

26 сентября 2017 г., 19:00:47

Hi Amit,

On 09/15/2017 04:50 AM, Amit Langote wrote:
> On 2017/09/15 11:16, Amit Langote wrote:
>> I will post rebased patches later today, although I think the overall
>> design of the patch on the planner side of things is not quite there yet.
>> Of course, your and others' feedback is greatly welcome.
> 
> Rebased patches attached.  Because Dilip complained earlier today about
> clauses of the form (const op var) not causing partition-pruning, I've
> added code to commute the clause where it is required.  Some other
> previously mentioned limitations remain -- no handling of OR clauses, no
> elimination of redundant clauses for given partitioning column, etc.
> 
> A note about 0001: this patch overlaps with
> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
> series that Ashutosh Bapat posted yesterday [1].  Because I implemented
> the planner-portion of this patch based on what 0001 builds, I'm posting
> it here.  It might actually turn out that we will review and commit
> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
> 0001 if you want to play with the later patches.  I would certainly like
> to review  0003-Canonical-partition-scheme.patch myself, but won't be able
> to immediately (see below).
> 

Could you share your thoughts on the usage of PartitionAppendInfo's 
min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.

I'm looking at get_partitions_for_keys.

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

26 сентября 2017 г., 20:33:04

On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Could you share your thoughts on the usage of PartitionAppendInfo's
> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.

This brings up something that I've kind of been thinking about.  There
are sort of four cases when it comes to partition pruning:

1. There is exactly one matching partition.  For example, this happens
when there is an equality constraint on every partition column.

2. There are multiple matching partitions which are consecutive.  For
example, there is a single level of range partitioning with no default
partition and the single partitioning column is constrained by < > <=
or >=.

3. There are multiple matching partitions which are not consecutive.
This case is probably rare, but it can happen if there is a default
partition, if there are list partitions with multiple bounds that are
interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5),
and the query allows values >= 4 and <= 5), if the query involves OR
conditions, or if there are multiple levels of partitioning (e.g.
partition by a, subpartition by b, put a range constraint on a and an
equality constraint on b).

4. There are no matching partitions.

One of the goals of this algorithm is to be fast.  The obvious way to
cater to case (3) is to iterate through all partitions and test
whether each one works, returning a Bitmapset, but that is O(n).
Admittedly, it might be O(n) with a pretty small constant factor, but
it still seems like exactly the sort of thing that we want to avoid
given the desire to scale to higher partition counts.

I propose that we create a structure that looks like this:

struct foo {  int min_partition;  int max_partition;  Bitmapset *extra_partitions;
};

This indicates that all partitions from min_partition to max_partition
need to be scanned, and in addition any partitions in extra_partitions
need to be scanned.  Assuming that we only consider cases where all
partition keys or a leading subset of the partition keys are
constrained, we'll generally be able to get by with just setting
min_partition and max_partition, but extra_partitions can be used to
handle default partitions and interleaved list bounds.  For equality
on all partitioning columns, we can do a single bsearch of the bounds
to identify the target partition at a given partitioning level, and
the same thing works for a single range-bound.  If there are two
range-bounds (< and > or <= and >= or whatever) we need to bsearch
twice.  The default partition, if any and if matched, must also be
included.  When there are multiple levels of partitioning things get a
bit more complex -- if someone wants to knock out a partition that
breaks up the range, we might need to shrink the main range to cover
part of it and kick the other indexes out to extra_partitions.

But the good thing is that in common cases with only O(lg n) effort we
can return O(1) data that describes what will be scanned.  In cases
where that's not practical we expend more effort but still prune with
maximal effectiveness.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

26 сентября 2017 г., 20:57:15

On 09/26/2017 10:33 AM, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Could you share your thoughts on the usage of PartitionAppendInfo's
>> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
> 
> This brings up something that I've kind of been thinking about.  There
> are sort of four cases when it comes to partition pruning:
> 
> 1. There is exactly one matching partition.  For example, this happens
> when there is an equality constraint on every partition column.
> 
> 2. There are multiple matching partitions which are consecutive.  For
> example, there is a single level of range partitioning with no default
> partition and the single partitioning column is constrained by < > <=
> or >=.
> 
> 3. There are multiple matching partitions which are not consecutive.
> This case is probably rare, but it can happen if there is a default
> partition, if there are list partitions with multiple bounds that are
> interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5),
> and the query allows values >= 4 and <= 5), if the query involves OR
> conditions, or if there are multiple levels of partitioning (e.g.
> partition by a, subpartition by b, put a range constraint on a and an
> equality constraint on b).
> 
> 4. There are no matching partitions.
> 
> One of the goals of this algorithm is to be fast.  The obvious way to
> cater to case (3) is to iterate through all partitions and test
> whether each one works, returning a Bitmapset, but that is O(n).
> Admittedly, it might be O(n) with a pretty small constant factor, but
> it still seems like exactly the sort of thing that we want to avoid
> given the desire to scale to higher partition counts.
> 
> I propose that we create a structure that looks like this:
> 
> struct foo {
>     int min_partition;
>     int max_partition;
>     Bitmapset *extra_partitions;
> };
> 
> This indicates that all partitions from min_partition to max_partition
> need to be scanned, and in addition any partitions in extra_partitions
> need to be scanned.  Assuming that we only consider cases where all
> partition keys or a leading subset of the partition keys are
> constrained, we'll generally be able to get by with just setting
> min_partition and max_partition, but extra_partitions can be used to
> handle default partitions and interleaved list bounds.  For equality
> on all partitioning columns, we can do a single bsearch of the bounds
> to identify the target partition at a given partitioning level, and
> the same thing works for a single range-bound.  If there are two
> range-bounds (< and > or <= and >= or whatever) we need to bsearch
> twice.  The default partition, if any and if matched, must also be
> included.  When there are multiple levels of partitioning things get a
> bit more complex -- if someone wants to knock out a partition that
> breaks up the range, we might need to shrink the main range to cover
> part of it and kick the other indexes out to extra_partitions.
> 
> But the good thing is that in common cases with only O(lg n) effort we
> can return O(1) data that describes what will be scanned.  In cases
> where that's not practical we expend more effort but still prune with
> maximal effectiveness.
> 

For OLTP style applications 1) would be the common case, and with hash 
partitions it would be one equality constraint.

So, changing the method signature to use a data type as you described 
above instead of the explicit min_datum_idx / max_datum_idx output 
parameters would be more clear.

One could advocate (*cough*) that the hash partition patch [1] should be 
merged first in order to find other instances of where other CommitFest 
entries doesn't account for hash partitions at the moment in their 
method signatures; Beena noted something similar in [2]. I know that you 
said otherwise [3], but this is CommitFest 1, so there is time for a 
revert later, and hash partitions are already useful in internal testing.

[1] https://commitfest.postgresql.org/14/1059/
[2] 
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com
[3] http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

26 сентября 2017 г., 22:51:52

On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> One could advocate (*cough*) that the hash partition patch [1] should be
> merged first in order to find other instances of where other CommitFest
> entries doesn't account for hash partitions at the moment in their method
> signatures; Beena noted something similar in [2]. I know that you said
> otherwise [3], but this is CommitFest 1, so there is time for a revert
> later, and hash partitions are already useful in internal testing.

Well, that's a fair point.  I was assuming that committing things in
that order would cause me to win the "least popular committer" award
at least for that day, but maybe not.  It's certainly not ideal to
have to juggle that patch along and keep rebasing it over other
changes when it's basically done, and just waiting on other
improvements to land.  Anybody else wish to express an opinion?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

27 сентября 2017 г., 03:04:10

On 25 September 2017 at 23:04, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> By the way, I'm now rebasing these patches on top of [1] and will try to
> merge your refactoring patch in some appropriate way.  Will post more
> tomorrow.
>
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826

Yeah, I see 0001 conflicts with that. I'm going to set this to waiting
on author while you're busy rebasing this.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

27 сентября 2017 г., 06:39:38

On 2017/09/27 1:51, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> One could advocate (*cough*) that the hash partition patch [1] should be
>> merged first in order to find other instances of where other CommitFest
>> entries doesn't account for hash partitions at the moment in their method
>> signatures; Beena noted something similar in [2]. I know that you said
>> otherwise [3], but this is CommitFest 1, so there is time for a revert
>> later, and hash partitions are already useful in internal testing.
> 
> Well, that's a fair point.  I was assuming that committing things in
> that order would cause me to win the "least popular committer" award
> at least for that day, but maybe not.  It's certainly not ideal to
> have to juggle that patch along and keep rebasing it over other
> changes when it's basically done, and just waiting on other
> improvements to land.  Anybody else wish to express an opinion?

FWIW, I tend to agree that it would be nice to get the hash partitioning
patch in, even with old constraint exclusion based partition-pruning not
working for hash partitions.  That way, it might be more clear what we
need to do in the partition-pruning patches to account for hash partitions.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

27 сентября 2017 г., 07:22:40

Hi David,

On 2017/09/27 6:04, David Rowley wrote:
> On 25 September 2017 at 23:04, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> By the way, I'm now rebasing these patches on top of [1] and will try to
>> merge your refactoring patch in some appropriate way.  Will post more
>> tomorrow.
>>
>> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826
> 
> Yeah, I see 0001 conflicts with that. I'm going to set this to waiting
> on author while you're busy rebasing this.

Thanks for the reminder.  Just thought I'd say that while I'm actually
done rebasing itself (attaching rebased patches to try 'em out), I'm now
considering Robert's comments and will be busy for a bit revising things
based on those comments.

Some notes about the attached patches:

- 0001 includes refactoring that Dilip proposed upthread [1] (added him as
  an author).  I slightly tweaked his patch -- renamed the function
  get_matching_clause to match_clauses_to_partkey, similar to
  match_clauses_to_index.

- Code to set AppendPath's partitioned_rels in add_paths_to_append_rel()
  revised by 0a480502b09 (was originally introduced in d3cc37f1d80) is
  still revised to get partitioned_rels from a source that is not
  PlannerInfo.pcinfo_list.  With the new code, partitioned_rels won't
  contain RT indexes of the partitioned child tables that were pruned.


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

27 сентября 2017 г., 11:03:20

Hi Jesper.

Firstly, thanks for looking at the patch.

On 2017/09/26 22:00, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 09/15/2017 04:50 AM, Amit Langote wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>>> I will post rebased patches later today, although I think the overall
>>> design of the patch on the planner side of things is not quite there yet.
>>> Of course, your and others' feedback is greatly welcome.
>>
>> Rebased patches attached.  Because Dilip complained earlier today about
>> clauses of the form (const op var) not causing partition-pruning, I've
>> added code to commute the clause where it is required.  Some other
>> previously mentioned limitations remain -- no handling of OR clauses, no
>> elimination of redundant clauses for given partitioning column, etc.
>>
>> A note about 0001: this patch overlaps with
>> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
>> series that Ashutosh Bapat posted yesterday [1].  Because I implemented
>> the planner-portion of this patch based on what 0001 builds, I'm posting
>> it here.  It might actually turn out that we will review and commit
>> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
>> 0001 if you want to play with the later patches.  I would certainly like
>> to review  0003-Canonical-partition-scheme.patch myself, but won't be able
>> to immediately (see below).
>>
> 
> Could you share your thoughts on the usage of PartitionAppendInfo's
> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
> 
> I'm looking at get_partitions_for_keys.

Sure.  You may have noticed that min_datum_idx and max_datum_idx in
PartitionAppendInfo are offsets in the PartitionDescData.boundinfo.datums
array, of datums that lie within the query-specified range (that is, using
=, >, >=, <, <= btree operators in the query). That array contains bounds
of all partitions sorted in the btree operator class defined order, at
least for list and range partitioning.  I haven't (yet) closely looked at
the composition of hash partition datums in PartitionBoundInfo, which
perhaps have different ordering properties (or maybe none) than list and
range partitioning datums.

Now, since they are offsets of datums in PartitionBoundInfo, not indexes
of partitions themselves, their utility outside partition.c might be
questionable.  But partition-wise join patch, for example, to determine if
two partitioned tables can be joined partition-wise, is going to check if
PartitionBoundInfos in RelOptInfos of two partitioned tables are
bound-to-bound equal [1].  Partition-pruning may select only a subset of
partitions of each of the joining partitioned tables.  Equi-join
requirement for partition-wise join means that the subset of partitions
will be same on both sides of the join.  My intent of having
min_datum_idx, max_datum_idx, along with contains_null_partition, and
contains_default_partition in PartitionAppendInfo is to have sort of a
cross check that those values end up being same on both sides of the join
after equi-join requirement has been satisfied.  That is,
get_partitions_for_keys will have chosen the same set of partitions for
both partitioned tables and hence will have set the same values for those
fields.

As mentioned above, that may be enough for list and range partitioning,
but since hash partition datums do not appear to have the same properties
as list and range datums [2], we may need an additional field(s) to
describe the hash partition selected by get_partitions_for_keys.  I guess
only one field will be enough, that will be the offset in the datums array
of the hash partition chosen for the query or -1 if query quals couldn't
conclusively determine one (maybe not all partition keys were specified to
be hashed or some or all used non-equality operator).

Hope that answers your question at least to some degree.  Now, there are
some points Robert mentioned in his reply that I will need to also
consider in the patch, which I'll go do now. :)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA%40mail.gmail.com

[2] It appears that in the hash partitioning case, unlike list and range   partitioning, PartitionBoundInfo doesn't
containvalues that are   directly comparable to query-specified constants, but a pair (modulus,   remainder) for each
partition. We'll first hash *all* the key values   (mentioned in the query) using the partitioning hash machinery and
thendetermine the hash partition index by using   (hash % largest_modulus) as offset into the
PartitionBoundInfo.indexes  array.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

amul sul

Дата:

27 сентября 2017 г., 12:20:06

On Wed, Sep 27, 2017 at 6:09 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/27 1:51, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> One could advocate (*cough*) that the hash partition patch [1] should be
>> merged first in order to find other instances of where other CommitFest
>> entries doesn't account for hash partitions at the moment in their method
>> signatures; Beena noted something similar in [2]. I know that you said
>> otherwise [3], but this is CommitFest 1, so there is time for a revert
>> later, and hash partitions are already useful in internal testing.
>
> Well, that's a fair point. I was assuming that committing things in
> that order would cause me to win the "least popular committer" award
> at least for that day, but maybe not. It's certainly not ideal to
> have to juggle that patch along and keep rebasing it over other
> changes when it's basically done, and just waiting on other
> improvements to land. Anybody else wish to express an opinion?

FWIW, I tend to agree that it would be nice to get the hash partitioning
patch in, even with old constraint exclusion based partition-pruning not
working for hash partitions. That way, it might be more clear what we
need to do in the partition-pruning patches to account for hash partitions.

regards,

Amul

Re: [HACKERS] path toward faster partition pruning

От

Dilip Kumar

Дата:

28 сентября 2017 г., 10:58:33

On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

I was looking into the latest patch set, seems like we can reuse some
more code between this path and runtime pruning[1]

+ foreach(lc1, matchedclauses[i])
+ {
+ Expr   *clause = lfirst(lc1);
+ Const  *rightop = (Const *) get_rightop(clause);
+ Oid opno = ((OpExpr *) clause)->opno,
+ opfamily = rel->part_scheme->partopfamily[i];
+ StrategyNumber strategy;
+
+ strategy = get_op_opfamily_strategy(opno, opfamily);
+ switch (strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (need_next_max)
+ {
+ maxkeys[i] = rightop->constvalue;
+ if (!maxkey_set[i])
+ n_maxkeys++;
+ maxkey_set[i] = true;
+ max_incl = (strategy == BTLessEqualStrategyNumber);
+ }
+ if (strategy == BTLessStrategyNumber)
+ need_next_max = false;

I think the above logic is common between this patch and the runtime
pruning.  I think we can make
a reusable function.  Here we are preparing minkey and maxkey of Datum
because we can directly fetch rightop->constvalue whereas for runtime
pruning we are making minkeys and maxkeys of Expr because during
planning time we don't have the values for the Param.  I think we can
always make these minkey, maxkey array of Expr and later those can be
processed in whatever way we want it.  So this path will fetch the
constval out of Expr and runtime pruning will Eval that expression at
runtime.

Does this make sense or it will cause one level of extra processing
for this path i.e converting the Expr array to CONST array?

[1] https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

28 сентября 2017 г., 14:14:56

On 2017/09/28 13:58, Dilip Kumar wrote:
> On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
> I was looking into the latest patch set, seems like we can reuse some
> more code between this path and runtime pruning[1]
> 
> + foreach(lc1, matchedclauses[i])
> + {
> + Expr   *clause = lfirst(lc1);
> + Const  *rightop = (Const *) get_rightop(clause);
> + Oid opno = ((OpExpr *) clause)->opno,
> + opfamily = rel->part_scheme->partopfamily[i];
> + StrategyNumber strategy;
> +
> + strategy = get_op_opfamily_strategy(opno, opfamily);
> + switch (strategy)
> + {
> + case BTLessStrategyNumber:
> + case BTLessEqualStrategyNumber:
> + if (need_next_max)
> + {
> + maxkeys[i] = rightop->constvalue;
> + if (!maxkey_set[i])
> + n_maxkeys++;
> + maxkey_set[i] = true;
> + max_incl = (strategy == BTLessEqualStrategyNumber);
> + }
> + if (strategy == BTLessStrategyNumber)
> + need_next_max = false;
> 
> I think the above logic is common between this patch and the runtime
> pruning.  I think we can make
> a reusable function.  Here we are preparing minkey and maxkey of Datum
> because we can directly fetch rightop->constvalue whereas for runtime
> pruning we are making minkeys and maxkeys of Expr because during
> planning time we don't have the values for the Param.  I think we can
> always make these minkey, maxkey array of Expr and later those can be
> processed in whatever way we want it.  So this path will fetch the
> constval out of Expr and runtime pruning will Eval that expression at
> runtime.

I think that makes sense.  In fact we could even move the minkey/maxkey
collection code to match_clauses_to_partkey() itself.  No need for a
different function and worrying about defining a separate interface for
the same.  We match clauses exactly because we want to extract the
constant or param values out of them.  No need to do the two activities
independently and in different places.

> Does this make sense or it will cause one level of extra processing
> for this path i.e converting the Expr array to CONST array?

Hm, it's not such a big cost to pay I'd think.

I will update the planner patch accordingly.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Dilip Kumar

Дата:

28 сентября 2017 г., 14:16:56

On Thu, Sep 28, 2017 at 1:44 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/28 13:58, Dilip Kumar wrote:

>> I think the above logic is common between this patch and the runtime
>> pruning.  I think we can make
>> a reusable function.  Here we are preparing minkey and maxkey of Datum
>> because we can directly fetch rightop->constvalue whereas for runtime
>> pruning we are making minkeys and maxkeys of Expr because during
>> planning time we don't have the values for the Param.  I think we can
>> always make these minkey, maxkey array of Expr and later those can be
>> processed in whatever way we want it.  So this path will fetch the
>> constval out of Expr and runtime pruning will Eval that expression at
>> runtime.
>
> I think that makes sense.  In fact we could even move the minkey/maxkey
> collection code to match_clauses_to_partkey() itself.  No need for a
> different function and worrying about defining a separate interface for
> the same.  We match clauses exactly because we want to extract the
> constant or param values out of them.  No need to do the two activities
> independently and in different places.
>

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

28 сентября 2017 г., 15:16:20

On 27 September 2017 at 14:22, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>   an author).  I slightly tweaked his patch -- renamed the function
>   get_matching_clause to match_clauses_to_partkey, similar to
>   match_clauses_to_index.

Hi Amit,

I've made a pass over the 0001 patch while trying to get myself up to
speed with the various developments that are going on in partitioning
right now.

These are just my thoughts from reading over the patch. I understand
that there's quite a bit up in the air right now about how all this is
going to work, but I have some thoughts about that too, which I
wouldn't mind some feedback on as my line of thought may be off.

Anyway, I will start with some small typos that I noticed while reading:

partition.c:

+ * telling what kind of NullTest has been applies to the corresponding

"applies" should be "applied"

plancat.c:

* might've occurred to satisfy the query.  Rest of the fields are set in

"Rest of the" should be "The remaining"

Any onto more serious stuff:

allpath.c:

get_rel_partitions()

I wonder if this function does not belong in partition.c. I'd have
expected a function to exist per partition type, RANGE and LIST, I'd
imagine should have their own function in partition.c to eliminate
partitions
which cannot possibly match, and return the remainder. I see people
speaking of HASH partitioning, but we might one day also want
something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
routing records to be processed into foreign tables where you never
need to query them again). It would be good if we could easily expand
this list and not have to touch files all over the optimizer to do
that. Of course, there would be other work to do in the executor to
implement any new partitioning method too.

I know there's mention of it somewhere about get_rel_partitions() not
being so smart about WHERE partkey > 20 AND partkey > 10, the code
does not understand what's more restrictive. I think you could
probably follow the same rules here that are done in
eval_const_expressions(). Over there I see that evaluate_function()
will call anything that's not marked as volatile. I'd imagine, for
each partition key, you'd want to store a Datum with the minimum and
maximum possible value based on the quals processed. If either the
minimum or maximum is still set to NULL, then it's unbounded at that
end. If you encounter partkey = Const, then minimum and maximum can be
set to the value of that Const. From there you can likely ignore any
other quals for that partition key, as if the query did contain
another qual with partkey = SomeOtherConst, then that would have
become a gating qual during the constant folding process. This way if
the user had written WHERE partkey >= 1 AND partkey <= 1 the
evaluation would end up the same as if they'd written WHERE partkey =
1 as the minimum and maximum would be the same value in both cases,
and when those two values are the same then it would mean just one
theoretical binary search on a partition range to find the correct
partition instead of two.

I see in get_partitions_for_keys you've crafted the function to return
something to identify which partitions need to be scanned. I think it
would be nice to see a special element in the partition array for the
NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
would be defined constants somewhere. The signature of that function
is not so pretty and that would likely tidy it up a bit. The matching
partition indexes could be returned as a Bitmapset, yet, I don't
really see any code which handles adding the NULL and DEFAULT
partition in get_rel_partitions() either, maybe I've just not looked
hard enough yet...

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

29 сентября 2017 г., 22:28:20

On Thu, Sep 28, 2017 at 5:16 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I'd imagine, for
> each partition key, you'd want to store a Datum with the minimum and
> maximum possible value based on the quals processed. If either the
> minimum or maximum is still set to NULL, then it's unbounded at that
> end. If you encounter partkey = Const, then minimum and maximum can be
> set to the value of that Const. From there you can likely ignore any
> other quals for that partition key, as if the query did contain
> another qual with partkey = SomeOtherConst, then that would have
> become a gating qual during the constant folding process. This way if
> the user had written WHERE partkey >= 1 AND partkey <= 1 the
> evaluation would end up the same as if they'd written WHERE partkey =
> 1 as the minimum and maximum would be the same value in both cases,
> and when those two values are the same then it would mean just one
> theoretical binary search on a partition range to find the correct
> partition instead of two.

I have not looked at the code submitted here in detail yet but I do
think we should try to avoid wasting cycles in the
presumably-quite-common case where equality is being tested.  The
whole idea of thinking of this as minimum/maximum seems like it might
be off precisely for that reason.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

02 октября 2017 г., 07:13:17

On 2017/09/30 1:28, Robert Haas wrote:
> On Thu, Sep 28, 2017 at 5:16 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> I'd imagine, for
>> each partition key, you'd want to store a Datum with the minimum and
>> maximum possible value based on the quals processed. If either the
>> minimum or maximum is still set to NULL, then it's unbounded at that
>> end. If you encounter partkey = Const, then minimum and maximum can be
>> set to the value of that Const. From there you can likely ignore any
>> other quals for that partition key, as if the query did contain
>> another qual with partkey = SomeOtherConst, then that would have
>> become a gating qual during the constant folding process. This way if
>> the user had written WHERE partkey >= 1 AND partkey <= 1 the
>> evaluation would end up the same as if they'd written WHERE partkey =
>> 1 as the minimum and maximum would be the same value in both cases,
>> and when those two values are the same then it would mean just one
>> theoretical binary search on a partition range to find the correct
>> partition instead of two.
> 
> I have not looked at the code submitted here in detail yet but I do
> think we should try to avoid wasting cycles in the
> presumably-quite-common case where equality is being tested.  The
> whole idea of thinking of this as minimum/maximum seems like it might
> be off precisely for that reason.

I agree.  Equality checks are going to be common enough to warrant them to
be handled specially, instead of implementing equality-pruning on top of
min/max framework.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

02 октября 2017 г., 17:58:16

On Sun, Oct 1, 2017 at 9:13 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I agree.  Equality checks are going to be common enough to warrant them to
> be handled specially, instead of implementing equality-pruning on top of
> min/max framework.

What you might do is pass <btree-strategy-number, bounds> and
optionally allow a second <btree-strategy-number, bounds>.  Then for
the common case of equality you can pass BTEqualStrategyNumber and for
a range bounded at both ends you can pass BTGreaterStrategyNumber or
BTGreaterEqualStrategyNumber for one bound and BTLessStrategyNumber or
BTLessEqualStrategyNumber for the other.

Not sure if this is exactly the right idea but it's what pops to mind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

03 октября 2017 г., 10:12:40

Hi David.

Thanks a lot for your review comments and sorry it took me a while to reply.

On 2017/09/28 18:16, David Rowley wrote:
> On 27 September 2017 at 14:22, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>>   an author).  I slightly tweaked his patch -- renamed the function
>>   get_matching_clause to match_clauses_to_partkey, similar to
>>   match_clauses_to_index.
> 
> Hi Amit,
> 
> I've made a pass over the 0001 patch while trying to get myself up to
> speed with the various developments that are going on in partitioning
> right now.
> 
> These are just my thoughts from reading over the patch. I understand
> that there's quite a bit up in the air right now about how all this is
> going to work, but I have some thoughts about that too, which I
> wouldn't mind some feedback on as my line of thought may be off.
> 
> Anyway, I will start with some small typos that I noticed while reading:
> 
> partition.c:
> 
> + * telling what kind of NullTest has been applies to the corresponding
> 
> "applies" should be "applied"

Will fix.

> plancat.c:
> 
> * might've occurred to satisfy the query.  Rest of the fields are set in
> 
> "Rest of the" should be "The remaining"

Will fix.

> Any onto more serious stuff:
> 
> allpath.c:
> 
> get_rel_partitions()
> 
> I wonder if this function does not belong in partition.c. I'd have
> expected a function to exist per partition type, RANGE and LIST, I'd
> imagine should have their own function in partition.c to eliminate
> partitions
> which cannot possibly match, and return the remainder. I see people
> speaking of HASH partitioning, but we might one day also want
> something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
> routing records to be processed into foreign tables where you never
> need to query them again). It would be good if we could easily expand
> this list and not have to touch files all over the optimizer to do
> that. Of course, there would be other work to do in the executor to
> implement any new partitioning method too.

I think there will have to be at least some code in the optimizer.  That
is, the code to match the query to the table's partition keys and using
the constant values therein to then look up the table's partitions.  More
importantly, once the partitions (viz. their offsets in the table's
partition descriptor) have been looked up by partition.c, we must be able
to then map them to their planner data structures viz. their
AppendRelInfo, and subsequently RelOptInfo.  This last part will have to
be in the optimizer.  In fact, that was the role of get_rel_partitions in
the initial versions of the patch, when neither of the code for matching
keys and for pruning using constants was implemented.

One may argue that the first part, that is, matching clauses to match to
the partition key and subsequent lookup of partitions using constants
could both be implemented in partition.c, but it seems better to me to
keep at least the part of matching clauses to the partition keys within
the planner (just like matching clauses to indexes is done by the
planner).  Looking up partitions using constants cannot be done outside
partition.c anyway, so that's something we have to implement there.

> I know there's mention of it somewhere about get_rel_partitions() not
> being so smart about WHERE partkey > 20 AND partkey > 10, the code
> does not understand what's more restrictive. I think you could
> probably follow the same rules here that are done in
> eval_const_expressions(). Over there I see that evaluate_function()
> will call anything that's not marked as volatile.

Hmm, unless I've missed it, I don't see in evaluate_function() anything
about determining which clause is more restrictive.  AFAIK, such
determination depends on the btree operator class semantics (at least in
the most common cases where, say, ">" means greater than in a sense that
btree code uses it), so I was planning to handle it the way btree code
handles it in _bt_preprocess_keys().  In fact, the existing code in
predtest.c, which makes decisions of the similar vein also relies on btree
semantics.  It's OK to do so, because partitioning methods that exist
today and for which we'd like to have such smarts use btree semantics to
partition the data.  Also, we won't need to optimize such cases for hash
partitioning anyway.

> I'd imagine, for
> each partition key, you'd want to store a Datum with the minimum and
> maximum possible value based on the quals processed. If either the
> minimum or maximum is still set to NULL, then it's unbounded at that
> end. If you encounter partkey = Const, then minimum and maximum can be
> set to the value of that Const. From there you can likely ignore any
> other quals for that partition key, as if the query did contain
> another qual with partkey = SomeOtherConst, then that would have
> become a gating qual during the constant folding process. This way if
> the user had written WHERE partkey >= 1 AND partkey <= 1 the
> evaluation would end up the same as if they'd written WHERE partkey =
> 1 as the minimum and maximum would be the same value in both cases,
> and when those two values are the same then it would mean just one
> theoretical binary search on a partition range to find the correct
> partition instead of two.

Given the way the patch recognizes a given qual as contributing to the
equal key or minimum key or maximum key, it will not conclude the above to
in fact be an equal key, because that presumably would require comparing
clauses among each other to make such a discovery.  I'm slightly hesitant
to add code to do that in the first version of the patch.  That is, for
time being let WHERE partkey >= 1 and partkey <= 1 be handled by passing 1
as both minimum and maximum key, which requires two binary searches.
Whereas, WHERE partkey = 1 would require only one.  Planner code to get
rid of the extra binary search lookup could come later, IMHO.

> I see in get_partitions_for_keys you've crafted the function to return
> something to identify which partitions need to be scanned. I think it
> would be nice to see a special element in the partition array for the
> NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
> would be defined constants somewhere. The signature of that function
> is not so pretty and that would likely tidy it up a bit. The matching
> partition indexes could be returned as a Bitmapset, yet, I don't
> really see any code which handles adding the NULL and DEFAULT
> partition in get_rel_partitions() either, maybe I've just not looked
> hard enough yet...

New version of the patch I will post soon cleans up the interface of
get_partitions_for_keys quite a bit; particularly the way selected
partitions are returned, for which I adopted an idea that Robert Haas
mentioned [2].  When it recognizes that a sequence of consecutive
partitions are to be scanned, it will return the starting and ending
offsets as *min_part_idx and *max_part_idx.  Those that don't fit this
pattern (of which there should be only a few in many cases) are returned
in a Bitmapset, as a supposedly unordered set of partitioned.  Since NULL
and DEFAULT partitions are partitions of this later category, they would
be included the bitmapset if it turns out that the query will need to scan
them after all.

Thanks again.

Regards,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf8269

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

19 октября 2017 г., 12:46:49

On 2017/09/27 10:22, Amit Langote wrote:
> Thanks for the reminder.  Just thought I'd say that while I'm actually
> done rebasing itself (attaching rebased patches to try 'em out), I'm now
> considering Robert's comments and will be busy for a bit revising things
> based on those comments.

And here is the updated, significantly re-designed patch set.  I'm
dropping the WIP- from the patches' names and marking this patch set as
the v1 set (Jesper Pedersen pointed out to me offline that the patches
didn't have the vNumber before).

Significant points of revision:

* After thinking a bit more about the applicability/re-usability of the
code being discussed here in the run-time pruning case, I came to the
conclusion that my previous approach was a bit wrongheaded (as perhaps
also what David Rowley was thinking when he commented on some aspects of
the old patch's design at [1], so thanks to him for prodding me in what I
ended up thinking to be a good direction after all).

With the previous approach, a bit too much work would be done by the
planner with no possibility of the code doing that work being useful in
the executor (interface involved passing RelOptInfo *).  So, if some
optimization trick that would lead to better pruning decision depended on
the constant values in all the clauses being available, we'd have to skip
that optimization for clauses that would otherwise be chosen as run-time
pruning clauses, because by definition, they would not have constant
values available.  In the new design, planner code limits itself to only
matching the clauses to partition key (checking things like whether the
operator of a clause matched to a partition column is compatible with
partitioning, etc.) and adding every matched clause to a list partclauses.
 That should work unchanged for both the plan-time pruning case and the
run-time pruning case.  We don't look at the supposedly-constant operands
of clauses in the aforementioned planner code at all.

Now if the clauses in partclauses are all known to contain the constant
operand values (the plan-time pruning case), it can immediately pass them
to partition.c to analyze those clauses, extract bounding keys in a form
suitable to do lookup in PartitionBoundInfo and prune partitions that
won't satisfy those bounding keys (and hence the clauses).

If partclauses contains clauses that don't have the constant operand (the
run-time pruning case), don't go to partition.c just yet, instead stuff
the list into the plan (Append) node and go to partition.c only when all
the constant values are available (the patch at [2] will implement that).

* Unlike the previous approach where partition.c would return a list of
integer indexes, where the individual indexes would be those of the bound
datums and not those of partitions themselves, in the new design, indexes
of the selected partitions (their offsets in the PartitionDesc.oids array)
are returned in a way that Robert suggested [3] -- as a range of
contiguous partitions whenever possible (in the form of *min_part_idx and
*max_part_idx) and/or as a set of partitions appearing in no particular
order (in the form of a Bitmapset).  Second form is required not only for
default/null-only/list partitions (which do not have any notion of
ordering among each other), but also when individual arms of an OR clause
select partitions scattered all over the place.

* In add_paths_to_append_rel(), the partitioned_rels list passed to copy
into the Append path is no longer same as the one found in
PlannerInfo.pcinfo_list, because the latter contains *all* partitioned
child relations in a partition tree.  Instead, the patch teaches it to
only include *live* partitioned child relations.

* Since the partitionwise join code looks at *all* entries in
RelOptInfo.part_rels of both the joining partitioned relations, there is
some new code there to make dead partitions' RelOptInfos look valid with a
dummy path *after-the-fact*.  That's because, set_append_rel_size() whose
job it is to initialize certain fields of child RelOptInfos will do it
only for *live* partitions with the new arrangement, where only live
partitions of a partitioned table are processed by the main loop of
set_append_rel_size().

Some notes about the regression tests:

Patch 0001 adds new tests for partition-pruning.  Constraint exclusion
using internal partition constraints is not disabled in the code until the
last patch, which implements the last piece needed for the new partition
pruning to do any real work.  With that patch, we see some differences in
the plan generated using the new partition-pruning code which appear in
the patch as the diffs to expected/inherit.out and expected/partition.out
(latter is the new output file added by 0001).  I've almost convinced
myself that those diffs are simply artifacts of the difference in
implementation between constraint exclusion and the new partition-pruning
code and do not change the output that the plans produce.  The difference
stems from that either the old or the new method, in some cases, fails to
prune away a partition that should have been.  OTOH, in neither case do we
prune away a partition that shouldn't have been. :)

Description of the attached patches:

0001: add new tests for partition-pruning

0002: patch that makes all the changes needed in the planer (adds a stub
      function in partition.c)

0003: patch that implements the aforementioned stub (significant amount of
      code to analyze partition clauses and gin up bounding keys to
      compare with the values in PartitionBoundInfo; the actual function
      that will do the comparison is just a stub as of this patch)

0004: make some preparatory changes to partition_bound_cmp/bsearch, to be
      able to pass incomplete partition keys (aka, prefix of a multi-
      column key) for comparison with the values in PartitionBoundInfo
     (just a refactoring patch)

0005: implements the stub mentioned in 0003 and finally gets the new
      partition-pruning working (also disables constraint exclusion using
      internal partition constraints by teaching get_relation_constraints
      to not include those).

Feedback greatly welcome.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f8Jzix8cs7tCDS_qNPd0CetHjB8x9fmLG4OTbCfthgo1w%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com

[3]
https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

23 октября 2017 г., 12:58:44

On Thu, Oct 19, 2017 at 12:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Description of the attached patches:

0001: add new tests for partition-pruning

0002: patch that makes all the changes needed in the planer (adds a stub
function in partition.c)

0003: patch that implements the aforementioned stub (significant amount of
code to analyze partition clauses and gin up bounding keys to
compare with the values in PartitionBoundInfo; the actual function
that will do the comparison is just a stub as of this patch)

0004: make some preparatory changes to partition_bound_cmp/bsearch, to be
able to pass incomplete partition keys (aka, prefix of a multi-
column key) for comparison with the values in PartitionBoundInfo
(just a refactoring patch)

0005: implements the stub mentioned in 0003 and finally gets the new
partition-pruning working (also disables constraint exclusion using
internal partition constraints by teaching get_relation_constraints
to not include those).

Feedback greatly welcome.

Hi Amit,

I have tried to apply attached patch. patch applied cleanly on commit id - bf54c0f05c0a58db17627724a83e1b6d4ec2712c

but make failed with below error.

./../../../src/include/nodes/relation.h:2126: error: redefinition of typedef ‘AppendRelInfo’
../../../../src/include/nodes/relation.h:584: note: previous declaration of ‘AppendRelInfo’ was here
make[4]: *** [gistbuild.o] Error 1

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

23 октября 2017 г., 13:42:44

Thanks Rajkumar.

On 2017/10/23 15:58, Rajkumar Raghuwanshi wrote:
> I have tried to apply attached patch. patch applied cleanly on commit id -
> bf54c0f05c0a58db17627724a83e1b6d4ec2712c
> but make failed with below error.
> 
> ./../../../src/include/nodes/relation.h:2126: error: redefinition of
> typedef ‘AppendRelInfo’
> ../../../../src/include/nodes/relation.h:584: note: previous declaration of
> ‘AppendRelInfo’ was here
> make[4]: *** [gistbuild.o] Error 1

The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
this typedef redefinition introduced by the 0002 patch, but it seems that
it's not needed anyway, so got rid of that line in the attached updated patch.

Fixed one more useless diff in 0002, but no changes in any other patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

23 октября 2017 г., 15:54:29

On Mon, Oct 23, 2017 at 1:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
this typedef redefinition introduced by the 0002 patch, but it seems that
it's not needed anyway, so got rid of that line in the attached updated patch.

Fixed one more useless diff in 0002, but no changes in any other patch

Thanks for updated patches, I am able to compile it on head.

While testing this, I got an observation, pruning is not scanning default partition leading to wrong output. below is test to reproduce this.

create table rp (a int, b varchar) partition by range (a);
create table rp_p1 partition of rp default;
create table rp_p2 partition of rp for values from (1) to (10);
create table rp_p3 partition of rp for values from (10) to (maxvalue);

insert into rp values (-1,'p1');
insert into rp values (1,'p2');
insert into rp values (11,'p3');

postgres=# select tableoid::regclass,* from rp;
tableoid | a | b
----------+----+----
rp_p2    | 1 | p2
rp_p3    | 11 | p3
rp_p1    | -1 | p1
(3 rows)

--with pruning
postgres=# explain (costs off) select * from rp where a <= 1;
        QUERY PLAN
--------------------------
Append
   -> Seq Scan on rp_p2
         Filter: (a <= 1)
(3 rows)

postgres=# select * from rp where a <= 1;
a | b
---+----
1 | p2
(1 row)

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Beena Emerson

Дата:

23 октября 2017 г., 22:38:59

On Mon, Oct 23, 2017 at 3:24 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
>
> On Mon, Oct 23, 2017 at 1:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
>> this typedef redefinition introduced by the 0002 patch, but it seems that
>> it's not needed anyway, so got rid of that line in the attached updated
>> patch.
>>
>> Fixed one more useless diff in 0002, but no changes in any other patch
>
>
> Thanks for updated patches, I am able to compile it on head.
>
> While testing this, I got an observation, pruning is not scanning default
> partition leading to wrong output. below is test to reproduce this.
>
> create table rp (a int, b varchar) partition by range (a);
> create table rp_p1 partition of rp default;
> create table rp_p2 partition of rp for values from (1) to (10);
> create table rp_p3 partition of rp for values from (10) to (maxvalue);
>
> insert into rp values (-1,'p1');
> insert into rp values (1,'p2');
> insert into rp values (11,'p3');
>
> postgres=# select tableoid::regclass,* from rp;
>  tableoid | a  | b
> ----------+----+----
>  rp_p2    |  1 | p2
>  rp_p3    | 11 | p3
>  rp_p1    | -1 | p1
> (3 rows)
>
> --with pruning
> postgres=# explain (costs off) select * from rp where a <= 1;
>         QUERY PLAN
> --------------------------
>  Append
>    ->  Seq Scan on rp_p2
>          Filter: (a <= 1)
> (3 rows)
>
> postgres=# select * from rp where a <= 1;
>  a | b
> ---+----
>  1 | p2
> (1 row)
>

I had noticed this and also that this crash:

tprt PARTITION BY RANGE(Col1)      tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)             tprt_11
FORVALUES FROM (1) TO (10000),             tprt_1d DEFAULT      tprt_2 FOR VALUES FROM (50001) TO (100001)
 

EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

25 октября 2017 г., 12:47:46

Thanks a lot Rajkumar and Beena for the tests.

On 2017/10/24 1:38, Beena Emerson wrote:
> On Mon, Oct 23, 2017 at 3:24 PM, Rajkumar Raghuwanshi wrote:
>> Thanks for updated patches, I am able to compile it on head.
>>
>> While testing this, I got an observation, pruning is not scanning default
>> partition leading to wrong output. below is test to reproduce this.
>>
>> create table rp (a int, b varchar) partition by range (a);
>> create table rp_p1 partition of rp default;
>> create table rp_p2 partition of rp for values from (1) to (10);
>> create table rp_p3 partition of rp for values from (10) to (maxvalue);
>>
>> insert into rp values (-1,'p1');
>> insert into rp values (1,'p2');
>> insert into rp values (11,'p3');
>>
>> postgres=# select tableoid::regclass,* from rp;
>>  tableoid | a  | b
>> ----------+----+----
>>  rp_p2    |  1 | p2
>>  rp_p3    | 11 | p3
>>  rp_p1    | -1 | p1
>> (3 rows)
>>
>> --with pruning
>> postgres=# explain (costs off) select * from rp where a <= 1;
>>         QUERY PLAN
>> --------------------------
>>  Append
>>    ->  Seq Scan on rp_p2
>>          Filter: (a <= 1)
>> (3 rows)
>>
>> postgres=# select * from rp where a <= 1;
>>  a | b
>> ---+----
>>  1 | p2
>> (1 row)

Both this (wrong output)...

> I had noticed this and also that this crash:
> 
> tprt PARTITION BY RANGE(Col1)
>        tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)
>               tprt_11 FOR VALUES FROM (1) TO (10000),
>               tprt_1d DEFAULT
>        tprt_2 FOR VALUES FROM (50001) TO (100001)
> 
> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>

...and this (crash) were due to bugs in the 0005 patch.

Output with the updated patch for Rajkumar's test:

explain (costs off ) select * from rp where a <= 1;
        QUERY PLAN
--------------------------
 Append
   ->  Seq Scan on rp_p2
         Filter: (a <= 1)
   ->  Seq Scan on rp_p1
         Filter: (a <= 1)
(5 rows)

select tableoid::regclass, * from rp where a <= 1;
 tableoid | a  | b
----------+----+----
 rp_p2    |  1 | p2
 rp_p1    | -1 | p1
(2 rows)

-- moreover
select tableoid::regclass, * from rp where a < 1;
 tableoid | a  | b
----------+----+----
 rp_d     | -1 | p1
(1 row)


Should be fixed in the attached updated version.  While fixing the bugs, I
made some significant revisions to the code introduced by 0005.

No significant changes to any of the patches 0001-0004.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

25 октября 2017 г., 13:37:05

On 2017/10/25 15:47, Amit Langote wrote:
> On 2017/10/24 1:38, Beena Emerson wrote:
>> I had noticed this and also that this crash:
>>
>> tprt PARTITION BY RANGE(Col1)
>>        tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)
>>               tprt_11 FOR VALUES FROM (1) TO (10000),
>>               tprt_1d DEFAULT
>>        tprt_2 FOR VALUES FROM (50001) TO (100001)
>>
>> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> !>
> 
> ...and this (crash) were due to bugs in the 0005 patch.

[ .... ]

> Should be fixed in the attached updated version.

Oops, not quite.  The crash that Beena reported wasn't fixed (or rather
reintroduced by some unrelated change after once confirming it was fixed).

Really fixed this time.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Beena,

Thanks for the tests.

On 2017/10/25 18:18, Beena Emerson wrote:
> The crashes are fixed. However, handling of DEFAULT partition in
> various queries is not proper.
> 
> Case 1: In this case default should be selected.
>  DROP TABLE tprt;
>   CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1);
>   CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001)
> PARTITION BY list(col1);
>   CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES IN (20000, 25000);
>   CREATE TABLE tprt_12 PARTITION OF tprt_1 FOR VALUES IN (50000, 35000);
>   CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES IN (10000);
>   CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT;
> 
> 
> postgres=#  EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000;
>         QUERY PLAN
> --------------------------
>  Result
>    One-Time Filter: false
> (2 rows)

Hmm, this clearly looks wrong.  Fixed in the attached.

> Case 2: In this case DEFAULT need not be selected.
> 
> DROP TABLE  tprt;
>   CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1);
>   CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001)
> PARTITION BY range(col1);
>   CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES FROM (1) TO (10000);
>   CREATE TABLE tprt_12 PARTITION OF tprt_1 FOR VALUES FROM (10000) TO (20000);
>   CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES FROM (20000) TO (30000);
>   CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT;
>   INSERT INTO tprt SELECT generate_series(1,50000), generate_series(1,50000);
> 
> postgres=#  EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000;
>            QUERY PLAN
> --------------------------------
>  Append
>    ->  Seq Scan on tprt_11
>          Filter: (col1 < 10000)
>    ->  Seq Scan on tprt_1d
>          Filter: (col1 < 10000)
> (5 rows)

Yeah, ideally.  But it's kind of hard to for the new partition-pruning
algorithm to be *that* correct in this particular case involving default
partitions.  Let me try to explain why I think it may be a bit hard to
implement.

I perhaps have mentioned before that the new partition-pruning algorithm
runs for every partitioned table in the tree separately.  In this example,
it will first determine for the root table tprt that only the partition
tprt_1 needs to be scanned.  Since tprt_1 is itself partitioned, algorithm
will run again, but the fact that tprt_1 (iow, any of its partitions) is
itself constrained to range [1, 50001) is, for the most part, lost on the
algorithm.  Note that non-default partitions (tprt_11, tprt_12, ...) have
bound datums in PartitionBoundInfo describing the range of data they
contain, which the algorithm uses to determine the set of partitions
satisfying given set of clauses.  The default partition has no datums.
The only thing describing what it contains is its partition constraint.
From the clause col1 < 10000, the algorithm will conclude that the default
partition might contain some data satisfying the same, because it knows
for sure that there no non-default partition for keys < 1.

It can perhaps taught to not make that conclusion by taking into account
the default partition's partition constraint, which includes constraint
inherited from the parent, viz. 1 <= col1 < 50001.  To do that, it might
be possible to summon up predtest.c's powers to conclude from the default
partition's partition constraint that it cannot contain any keys < 1, but
then we'll have to frame up a clause expression describing the latter.
Generating such a clause expression can be a bit daunting for a
multi-column key.   So, I haven't yet tried really hard to implement this.
 Any thoughts on that?

Meanwhile, attached updated set of patches including fixes for the typos
you reported in the other message.  Updated 0005 fixes the first bug (the
Case 1 in your email), while other patches 0002-0004 are updated mostly to
fix the reported typos.  A couple of tests are added in 0001 to test the
default partition case a bit more.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Thu, Oct 26, 2017 at 4:47 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Meanwhile, attached updated set of patches including fixes for the typos
you reported in the other message. Updated 0005 fixes the first bug (the
Case 1 in your email), while other patches 0002-0004 are updated mostly to
fix the reported typos. A couple of tests are added in 0001 to test the
default partition case a bit more.

Hi Amit,

while testing further this feature, I got a bug with partitions as foreign tables. Test case given below. Take a look.

CREATE EXTENSION postgres_fdw;
CREATE SERVER fp_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres', port '5432', use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER fp_server;

CREATE TABLE fplt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN ('0000', '0001', '0002', '0003') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p1');
CREATE TABLE fplt1_p2 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN ('0004', '0005', '0006', '0007') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p2');

INSERT INTO fplt1_p1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 198, 2) i;
INSERT INTO fplt1_p2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(200, 398, 2) i;

--PG-HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q;
                    QUERY PLAN
--------------------------------------------------
Nested Loop
   -> Append
         -> Foreign Scan on ftplt1_p1 t1
         -> Foreign Scan on ftplt1_p2 t1_1
   -> Unique
         -> Append
               -> Foreign Scan on ftplt1_p1 t2
               -> Foreign Scan on ftplt1_p2 t2_1
(8 rows)

--PG-HEAD +v5 patches
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q;
ERROR: invalid expression for partition key

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

27 октября 2017 г., 15:11:25

Thanks Rajkumar for the test case.

On 2017/10/27 17:05, Rajkumar Raghuwanshi wrote:
> while testing further this feature, I got a bug with partitions as foreign
> tables. Test case given below. Take a look.

[ ... ]

> 
> --PG-HEAD
> postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT
> DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
>                     QUERY PLAN
> --------------------------------------------------
>  Nested Loop
>    ->  Append
>          ->  Foreign Scan on ftplt1_p1 t1
>          ->  Foreign Scan on ftplt1_p2 t1_1
>    ->  Unique
>          ->  Append
>                ->  Foreign Scan on ftplt1_p1 t2
>                ->  Foreign Scan on ftplt1_p2 t2_1
> (8 rows)
> 
> --PG-HEAD +v5 patches
> postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT
> DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
> 
> *ERROR:  invalid expression for partition key*

I looked at this and it seems the error occurs not because partitions
being foreign tables, but because the new code is wrong to assume that
Param nodes can never appear in the clauses coming from baserestrictinfo.
When trying to do the plan-time pruning for the partitioned table
appearing inside the lateral subquery, there are Params in the clauses in
baserestrictinfo that the new pruning code was unprepared to handle.
Fixed the code to instead give up on plan-time pruning in such a case.

Attached updated set of patches.  In addition to fixing the above bug, it
also fixes one of the cases reported by Beena regarding default partition
pruning that I yesterday had given up on as being too difficult to
implement [1], but today found out is not that difficult to do [2].
Change summary:

0001: added some new tests
0002: no change
0003: fixed issue that Rajkumar reported (cope with Params properly)
0004: no change
0005: fix the case to prune the default partition when warranted (the
      issue reported by Beena)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0d6096e8-7c7b-afed-71d3-dca151306626%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/8499324c-8a33-4be7-9d23-7e6a95e60ddf%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

27 октября 2017 г., 16:29:13

On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

0001: added some new tests
0002: no change
0003: fixed issue that Rajkumar reported (cope with Params properly)
0004: no change
0005: fix the case to prune the default partition when warranted (the
issue reported by Beena)

Thanks for the updated patch, i am getting server crash with below query.

CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY RANGE(c2);
CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY RANGE(c2);
CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);

INSERT INTO mp VALUES(10, 100, 10);
INSERT INTO mp VALUES(20, 200, 20);
INSERT INTO mp VALUES(21, 150, 30);
INSERT INTO mp VALUES(30, 200, 40);
INSERT INTO mp VALUES(31, 300, 30);
INSERT INTO mp VALUES(40, 400, 40);

EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND c2 < 300;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

30 октября 2017 г., 11:55:48

Thanks Rajkumar.

On 2017/10/27 19:29, Rajkumar Raghuwanshi wrote:
> On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote wrote:
>> 0001: added some new tests
>> 0002: no change
>> 0003: fixed issue that Rajkumar reported (cope with Params properly)
>> 0004: no change
>> 0005: fix the case to prune the default partition when warranted (the
>>       issue reported by Beena)
>>
> 
> Thanks for the updated patch, i am getting server crash with below query.
> 
> CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
> CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY
> RANGE(c2);
> CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
> CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
> CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY
> RANGE(c2);
> CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
> CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);
> 
> INSERT INTO mp VALUES(10, 100, 10);
> INSERT INTO mp VALUES(20, 200, 20);
> INSERT INTO mp VALUES(21, 150, 30);
> INSERT INTO mp VALUES(30, 200, 40);
> INSERT INTO mp VALUES(31, 300, 30);
> INSERT INTO mp VALUES(40, 400, 40);
> 
> EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND
> c2 < 300;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Looks like an update I'd included in the last posted patch (viz. add the
non-root partitioned tables' partition constraint clauses to the list of
clauses used for pruning) exposed a bug in how ScalarArrayOpExpr clauses
are being handled by the new pruning code.  A partitioned list partition's
internal partition constraint clause contains ArrayExpr as the
ScalarArrayOpExpr's right-hand operand, whereas the pruning code thought
there could only ever be a Const holding an ArrayType value.

Fixed in the attached updated patch, along with a new test in 0001 to
cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
code from the former to the latter) around the handling of ScalarArrayOpExprs.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

30 октября 2017 г., 12:50:25

On 2017/10/30 14:55, Amit Langote wrote:
> Fixed in the attached updated patch, along with a new test in 0001 to
> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
> code from the former to the latter) around the handling of ScalarArrayOpExprs.

Sorry, I'd forgotten to include some changes.

In the previous versions, RT index of the table needed to be passed to
partition.c, which I realized is no longer needed, so I removed that
requirement from the interface.  As a result, patches 0002 and 0003 have
changed in this version.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

30 октября 2017 г., 14:09:13

On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

In the previous versions, RT index of the table needed to be passed to
partition.c, which I realized is no longer needed, so I removed that
requirement from the interface. As a result, patches 0002 and 0003 have
changed in this version.

Thanks for the fix.

I am getting wrong output when default is sub-partitioned further, below is a test case.

CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
INSERT INTO lpd VALUES (null,null,null);

--on HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
                 QUERY PLAN
---------------------------------------------
Sort
   Sort Key: ((lpd_p1.tableoid)::regclass)
   -> Result
         -> Append
               -> Seq Scan on lpd_p1
                     Filter: (a IS NOT NULL)
               -> Seq Scan on lpd_p2
                     Filter: (a IS NOT NULL)
               -> Seq Scan on lpd_d3
                     Filter: (a IS NOT NULL)
               -> Seq Scan on lpd_d1
                     Filter: (a IS NOT NULL)
               -> Seq Scan on lpd_d2
                     Filter: (a IS NOT NULL)
(14 rows)

postgres=#
postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
tableoid | a | b | c
----------+----+----+----
lpd_p1   | 1 | 1 | 1
lpd_p1   | 2 | 2 | 2
lpd_p1   | 3 | 3 | 3
lpd_p2   | 4 | 4 | 4
lpd_p2   | 5 | 5 | 5
lpd_d1   | 7 | 7 | 7
lpd_d1   | 8 | 8 | 8
lpd_d1   | 9 | 9 | 9
lpd_d2   | 12 | 12 | 12
lpd_d2   | 10 | 10 | 10
lpd_d2   | 11 | 11 | 11
lpd_d3   | 6 | 6 | 6
(12 rows)

--on HEAD + v8 patches

postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
                 QUERY PLAN
---------------------------------------------
Sort
   Sort Key: ((lpd_p1.tableoid)::regclass)
   -> Result
         -> Append
               -> Seq Scan on lpd_p1
                     Filter: (a IS NOT NULL)
               -> Seq Scan on lpd_p2
                     Filter: (a IS NOT NULL)
(8 rows)

postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
tableoid | a | b | c
----------+---+---+---
lpd_p1   | 1 | 1 | 1
lpd_p1   | 2 | 2 | 2
lpd_p1   | 3 | 3 | 3
lpd_p2   | 4 | 4 | 4
lpd_p2   | 5 | 5 | 5
(5 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Dilip Kumar

Дата:

30 октября 2017 г., 22:50:01

On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/10/30 14:55, Amit Langote wrote:
>> Fixed in the attached updated patch, along with a new test in 0001 to
>> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
>> code from the former to the latter) around the handling of ScalarArrayOpExprs.
>
> Sorry, I'd forgotten to include some changes.
>
> In the previous versions, RT index of the table needed to be passed to
> partition.c, which I realized is no longer needed, so I removed that
> requirement from the interface.  As a result, patches 0002 and 0003 have
> changed in this version.

Some Minor comments:

+ * get_rel_partitions
+ * Return the list of partitions of rel that pass the clauses mentioned
+ * rel->baserestrictinfo
+ *
+ * Returned list contains the AppendRelInfos of chosen partitions.
+ */
+static List *
+get_append_rel_partitions(PlannerInfo *root,

Function name in function header is not correct.

+ !DatumGetBool(((Const *) clause)->constvalue))
+ {
+ *constfalse = true;
+ continue;

DatumGetBool will return true if the first byte of constvalue is
nonzero otherwise
false.  IIUC, this is not the intention here. Or I am missing something?

+ * clauses in this function ourselves, for example, having both a > 1 and
+ * a = 0 the list

a = 0 the list -> a = 0 in the list

+static bool
+partkey_datum_from_expr(const Expr *expr, Datum *value)
+{
+ /*
+ * Add more expression types here as needed to support higher-level
+ * code.
+ */
+ switch (nodeTag(expr))
+ {
+ case T_Const:
+ *value = ((Const *) expr)->constvalue;
+ return true;

I think for evaluating other expressions (e.g. T_Param) we will have
to pass ExprContext to this function. Or we can do something cleaner
because if we want to access the ExprContext inside
partkey_datum_from_expr then we may need to pass it to
"get_partitions_from_clauses" which is a common function for optimizer
and executor.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

31 октября 2017 г., 13:53:30

Thanks for the test case.

On 2017/10/30 17:09, Rajkumar Raghuwanshi wrote:
> I am getting wrong output when default is sub-partitioned further, below is
> a test case.
> 
> CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
> CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
> CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
> CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
> CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
> CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
> CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
> INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
> INSERT INTO lpd VALUES (null,null,null);
> 
> --on HEAD
> postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE
> a IS NOT NULL ORDER BY 1;
>                  QUERY PLAN
> ---------------------------------------------
>  Sort
>    Sort Key: ((lpd_p1.tableoid)::regclass)
>    ->  Result
>          ->  Append
>                ->  Seq Scan on lpd_p1
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_p2
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_d3
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_d1
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_d2
>                      Filter: (a IS NOT NULL)
> (14 rows)
> 
> postgres=#
> postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER
> BY 1;
>  tableoid | a  | b  | c
> ----------+----+----+----
>  lpd_p1   |  1 | 1  |  1
>  lpd_p1   |  2 | 2  |  2
>  lpd_p1   |  3 | 3  |  3
>  lpd_p2   |  4 | 4  |  4
>  lpd_p2   |  5 | 5  |  5
>  lpd_d1   |  7 | 7  |  7
>  lpd_d1   |  8 | 8  |  8
>  lpd_d1   |  9 | 9  |  9
>  lpd_d2   | 12 | 12 | 12
>  lpd_d2   | 10 | 10 | 10
>  lpd_d2   | 11 | 11 | 11
>  lpd_d3   |  6 | 6  |  6
> (12 rows)
> 
> 
> --on HEAD + v8 patches
> 
> postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE
> a IS NOT NULL ORDER BY 1;
>                  QUERY PLAN
> ---------------------------------------------
>  Sort
>    Sort Key: ((lpd_p1.tableoid)::regclass)
>    ->  Result
>          ->  Append
>                ->  Seq Scan on lpd_p1
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_p2
>                      Filter: (a IS NOT NULL)
> (8 rows)
> 
> postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER
> BY 1;
>  tableoid | a | b | c
> ----------+---+---+---
>  lpd_p1   | 1 | 1 | 1
>  lpd_p1   | 2 | 2 | 2
>  lpd_p1   | 3 | 3 | 3
>  lpd_p2   | 4 | 4 | 4
>  lpd_p2   | 5 | 5 | 5
> (5 rows)

I found bugs in 0003 and 0005 that caused this.  Will post the patches
containing the fix in reply to the Dilip's email which contains some code
review comments [1].

Also, I noticed that the new pruning code was having a hard time do deal
with the fact that the default "range" partition doesn't explicitly say in
its partition constraint that it might contain null values.  More
precisely perhaps, the default range partition's constraint appears to
imply that it can only contain non-null values, which confuses the new
pruning code.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-thYsobXxPS6bwOA_9erpax_S=iztSn3RtUxKKMKG4V4A@mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

31 октября 2017 г., 14:43:51

Thanks Dilip for reviewing.

On 2017/10/31 1:50, Dilip Kumar wrote:
> On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/10/30 14:55, Amit Langote wrote:
>>> Fixed in the attached updated patch, along with a new test in 0001 to
>>> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
>>> code from the former to the latter) around the handling of ScalarArrayOpExprs.
>>
>> Sorry, I'd forgotten to include some changes.
>>
>> In the previous versions, RT index of the table needed to be passed to
>> partition.c, which I realized is no longer needed, so I removed that
>> requirement from the interface.  As a result, patches 0002 and 0003 have
>> changed in this version.
> 
> Some Minor comments:
> 
> + * get_rel_partitions
> + * Return the list of partitions of rel that pass the clauses mentioned
> + * rel->baserestrictinfo
> + *
> + * Returned list contains the AppendRelInfos of chosen partitions.
> + */
> +static List *
> +get_append_rel_partitions(PlannerInfo *root,
> 
> Function name in function header is not correct.

Fixed.

> + !DatumGetBool(((Const *) clause)->constvalue))
> + {
> + *constfalse = true;
> + continue;
> 
> DatumGetBool will return true if the first byte of constvalue is
> nonzero otherwise
> false.  IIUC, this is not the intention here. Or I am missing something?

This coding pattern is in use in quite a few places; see for example in
restriction_is_constant_false() and many others like
relation_excluded_by_constraints(), negate_clause(), etc.

If a RestrictInfo is marked pseudoconstant=true, then the clause therein
must be a Const with constvalue computing to 0 if the clause is false, so
that DatumGetBool(constvalue) returns boolean false and non-zero otherwise.

> + * clauses in this function ourselves, for example, having both a > 1 and
> + * a = 0 the list
> 
> a = 0 the list -> a = 0 in the list

Right, fixed.

> 
> +static bool
> +partkey_datum_from_expr(const Expr *expr, Datum *value)
> +{
> + /*
> + * Add more expression types here as needed to support higher-level
> + * code.
> + */
> + switch (nodeTag(expr))
> + {
> + case T_Const:
> + *value = ((Const *) expr)->constvalue;
> + return true;
> 
> I think for evaluating other expressions (e.g. T_Param) we will have
> to pass ExprContext to this function.

That's right.

> Or we can do something cleaner
> because if we want to access the ExprContext inside
> partkey_datum_from_expr then we may need to pass it to
> "get_partitions_from_clauses" which is a common function for optimizer
> and executor.

Yeah, I've thought about that a little.  Since nothing else but the
planner calls it now and the planner doesn't itself have its hands on the
ExprContext that would be necessary for computing something like Params, I
left it out of the interface for now.  That said, I *am* actually thinking
about some interface changes that would be necessary for some other
unrelated functionality/optimizations that callers like the run-time
pruning code would expect of get_partitions_from_clauses().  We can design
the interface extension such that the aforementioned ExprContext is passed
together.

Attached updated version of the patches addressing some of your comments
above and fixing a bug that Rajkumar reported [1].  As mentioned there,
I'm including here a patch (the 0005 of the attached) to tweak the default
range partition constraint to be explicit about null values that it might
contain.  So, there are 6 patches now and what used to be patch 0005 in
the previous set is patch 0006 in this version of the set.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/cd5a2d2e-0957-042c-40c2-06033fe0abf2@lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 3 November 2017 at 17:32, David Rowley <david.rowley@2ndquadrant.com> wrote:
> 2. This code is way more complex than it needs to be.
>
> if (num_parts > 0)
> {
> int j;
>
> all_indexes = (int *) palloc(num_parts * sizeof(int));
> j = 0;
> if (min_part_idx >= 0 && max_part_idx >= 0)
> {
> for (i = min_part_idx; i <= max_part_idx; i++)
> all_indexes[j++] = i;
> }
> if (!bms_is_empty(other_parts))
> while ((i = bms_first_member(other_parts)) >= 0)
> all_indexes[j++] = i;
> if (j > 1)
> qsort((void *) all_indexes, j, sizeof(int), intcmp);
> }
>
> It looks like the min/max partition stuff is just complicating things
> here. If you need to build this array of all_indexes[] anyway, I don't
> quite understand the point of the min/max. It seems like min/max would
> probably work a bit nicer if you didn't need the other_parts
> BitmapSet, so I recommend just getting rid of min/max completely and
> just have a BitmapSet with bit set for each partition's index you
> need, you'd not need to go to the trouble of performing a qsort on an
> array and you could get rid of quite a chunk of code too.
>
> The entire function would then not be much more complex than:
>
> partindexes = get_partitions_from_clauses(parent, partclauses);
>
> while ((i = bms_first_member(partindexes)) >= 0)
> {
>     AppendRelInfo *appinfo = rel->part_appinfos[i];
>     result = lappend(result, appinfo);
> }
>
> Then you can also get rid of your intcmp() function too.

I've read a bit more of the patch and I think even more now that the
min/max stuff should be removed.

I understand that you'll be bsearching for a lower and upper bound for
cases like:

SELECT * FROM pt WHERE key BETWEEN 10 and 20;

but it looks like the min and max range stuff is thrown away if the
query is written as:

SELECT * FROM pt WHERE key BETWEEN 10 and 20 OR key BETWEEN 30 AND 40;

from reading the code, it seems like partset_union() would be called
in this case and if the min/max of each were consecutive then the
min/max range would get merged, but there's really a lot of code to
support this. I think it would be much better to invent
bms_add_range() and just use a Bitmapset to store the partition
indexes to scan. You could simply use bms_union for OR cases and
bms_intersect() or AND cases. It seems this would allow removal of
this complex code. It looks like this would allow you to remove all
the partset_range_* macros too.

I've attached a patch which implements bms_add_range() which will save
you from having to write the tight loops to call bms_add_member() such
as the ones in partset_union(). Those would not be so great if there
was a huge number of partitions as the Bitmapset->words[] array could
be expanded many more times than required. bms_add_range() will handle
that much more efficiently with a maximum of 1 repalloc() for the
whole operation. It would also likely faster since it's working at the
bitmapword level rather than bit level.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

0001-Add-bms_add_range-to-add-members-within-the-specifie.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

06 ноября 2017 г., 10:15:30

On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated version of the patches

match_clauses_to_partkey() needs to allow for the way quals on Bool
columns are represented.

create table pt (a bool not null) partition by list (a);
create table pt_true partition of pt for values in('t');
create table pt_false partition of pt for values in('f');
explain select * from pt where a = true;                           QUERY PLAN
------------------------------------------------------------------Append  (cost=0.00..76.20 rows=2810 width=1)  ->  Seq
Scanon pt_false  (cost=0.00..38.10 rows=1405 width=1)        Filter: a  ->  Seq Scan on pt_true  (cost=0.00..38.10
rows=1405width=1)        Filter: a
 
(5 rows)

match_clause_to_indexcol() shows an example of how to handle this.

explain select * from pt where a = false;

will need to be allowed too. This works slightly differently.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

Amit Langote

Дата:

06 ноября 2017 г., 10:30:02

Hi David,

On 2017/11/03 13:32, David Rowley wrote:
> On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated version of the patches addressing some of your comments
> 
> I've spent a bit of time looking at these but I'm out of time for now.

Thanks a lot for the review and sorry for the delay in sending rebased
patches.

> So far I have noted down the following;
> 
> 1. This comment seem wrong.
> 
> /*
> * Since the clauses in rel->baserestrictinfo should all contain Const
> * operands, it should be possible to prune partitions right away.
> */

Yes.  I used to think it was true, then realized it isn't and updated the
code to get rid of that assumption, but I forgot updating this comment.
Fixed.

> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
> baserestrictinfo in this case will contain a single RestrictInfo with
> an OpExpr containing two Var args and it'll come right through that
> function too.

As it should I think.  Quite similarly, you will be able see that index
path won't be considered for such a clause:

create table foo (a int, b int);
create index fooi on foo (a);
insert into foo select generate_series(1, 100000);

explain select * from foo where a = 1;
                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=12.17..482.50 rows=500 width=8)
   Recheck Cond: (a = 1)
   ->  Bitmap Index Scan on fooi  (cost=0.00..12.04 rows=500 width=0)
         Index Cond: (a = 1)
(4 rows)

explain select * from foo where a <= b;
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on foo  (cost=0.00..1693.00 rows=500 width=8)
   Filter: (a = b)
(2 rows)

We won't be able to use such a clause for pruning at all; neither
planning-time pruning nor execution-time pruning.  Am I missing something?

> 2. This code is way more complex than it needs to be.
> 
> if (num_parts > 0)
> {
> int j;
> 
> all_indexes = (int *) palloc(num_parts * sizeof(int));
> j = 0;
> if (min_part_idx >= 0 && max_part_idx >= 0)
> {
> for (i = min_part_idx; i <= max_part_idx; i++)
> all_indexes[j++] = i;
> }
> if (!bms_is_empty(other_parts))
> while ((i = bms_first_member(other_parts)) >= 0)
> all_indexes[j++] = i;
> if (j > 1)
> qsort((void *) all_indexes, j, sizeof(int), intcmp);
> }
> 
> It looks like the min/max partition stuff is just complicating things
> here. If you need to build this array of all_indexes[] anyway, I don't
> quite understand the point of the min/max. It seems like min/max would
> probably work a bit nicer if you didn't need the other_parts
> BitmapSet, so I recommend just getting rid of min/max completely and
> just have a BitmapSet with bit set for each partition's index you
> need, you'd not need to go to the trouble of performing a qsort on an
> array and you could get rid of quite a chunk of code too.
>
> The entire function would then not be much more complex than:
>
> partindexes = get_partitions_from_clauses(parent, partclauses);
>
> while ((i = bms_first_member(partindexes)) >= 0)
> {
>     AppendRelInfo *appinfo = rel->part_appinfos[i];
>     result = lappend(result, appinfo);
> }
>
> Then you can also get rid of your intcmp() function too.

The design with min/max partition index interface to the partition.c's new
partition-pruning facility is intentional.  You can find hints about how
such a design came about in the following Robert's email:

https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

For range queries, it is desirable for the partitioning module to return
the set of qualifying partitions that are contiguous in a compact (O(1))
min/max representation than having to enumerate all those indexes in the
set.  It's nice to avoid iterating over that set twice -- once when
constructing the set in the partitioning module and then again in the
caller (in this case, planner) to perform the planning-related tasks per
selected partition.

We need the other_parts Bitmapset too, because selected partitions may not
always be contiguous, even in the case of range partitioning, if there are
OR clauses and the possibility that the default partition is also
selected.  While computing the selected partition set from a given set of
clauses, partitioning code tries to keep the min/max representation as
long as it makes sense to and once the selected partitions no longer
appear to be contiguous it switches to the Bitmapset mode.

> 3. Following code has the wrong size calculation:
> 
> memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType *));
> 
> should be PARTITION_MAX_KEYS * sizeof(NullTestType). It might have
> worked on your machine if you're compiling as 32 bit.

Oops, you're right.  Fixed.

> I'll continue on with the review in the next few days.

Thanks again.

Attached is the updated set of patches.

Regards,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2017/11/06 14:32, David Rowley wrote:
> On 6 November 2017 at 17:30, Amit Langote wrote:
>> On 2017/11/03 13:32, David Rowley wrote:
>>> On 31 October 2017 at 21:43, Amit Langote wrote:
>>> 1. This comment seem wrong.
>>>
>>> /*
>>> * Since the clauses in rel->baserestrictinfo should all contain Const
>>> * operands, it should be possible to prune partitions right away.
>>> */
>>
>> Yes.  I used to think it was true, then realized it isn't and updated the
>> code to get rid of that assumption, but I forgot updating this comment.
>> Fixed.
>>
>>> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
>>> baserestrictinfo in this case will contain a single RestrictInfo with
>>> an OpExpr containing two Var args and it'll come right through that
>>> function too.
> 
> ...
> 
>> We won't be able to use such a clause for pruning at all; neither
>> planning-time pruning nor execution-time pruning.  Am I missing something?
> 
> I just meant the comment was wrong.

Ah, gotcha.

>> The design with min/max partition index interface to the partition.c's new
>> partition-pruning facility is intentional.  You can find hints about how
>> such a design came about in the following Robert's email:
>>
>> https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com
> 
> Yeah, I remember reading that before I had looked at the code. I
> disagree with Robert on this. The fact that the min/max range gets
> turned into a list of everything in that range in
> get_append_rel_partitions means all the advantages that storing the
> partitions as a range is voided. If you could have kept it a range the
> entire time, then that might be different, but seems you need to
> materialize the entire range in order to sort the partitions into
> order. I've included Robert in just in case he wants to take a look at
> the code that resulted from that design. Maybe something is not
> following what he had in mind, or maybe he'll change his mind based on
> the code that resulted.
> 
>> For range queries, it is desirable for the partitioning module to return
>> the set of qualifying partitions that are contiguous in a compact (O(1))
>> min/max representation than having to enumerate all those indexes in the
>> set.  It's nice to avoid iterating over that set twice -- once when
>> constructing the set in the partitioning module and then again in the
>> caller (in this case, planner) to perform the planning-related tasks per
>> selected partition.
> 
> The idea is that you still get the min and max from the bsearch, but
> then use bms_add_range() to populate a bitmapset of the matching
> partitions. The big-O notation of the search shouldn't change.
> 
>> We need the other_parts Bitmapset too, because selected partitions may not
>> always be contiguous, even in the case of range partitioning, if there are
>> OR clauses and the possibility that the default partition is also
>> selected.  While computing the selected partition set from a given set of
>> clauses, partitioning code tries to keep the min/max representation as
>> long as it makes sense to and once the selected partitions no longer
>> appear to be contiguous it switches to the Bitmapset mode.
> 
> Yip. I understand that. I just think there's no benefit to having
> min/max since it needs to be materialized into a list of the entire
> range at some point, it might as well be done as soon as possible
> using a bitmapset, which would save having all the partset_union,
> partset_intersect, partset_range_empty, partset_range_overlap,
> partset_range_adjacent code. You'd end up just using bms_union and
> bms_intersect then bms_add_range to handle the min/max bound you get
> from the bsearch.

OK, I have gotten rid of the min/max partition index interface and instead
adopted the bms_add_range() approach by including your patch to add the
same in the patch set (which is now 0002 in the whole set).  I have to
admit that it's simpler to understand the new code with just Bitmapsets to
look at, but I'm still a bit concerned about materializing the whole set
right within partition.c, although we can perhaps optimize it later.

Attached updated set of patches, including the fix to make the new pruning
code handle Boolean partitioning.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

06 ноября 2017 г., 18:52:51

On 6 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> OK, I have gotten rid of the min/max partition index interface and instead
> adopted the bms_add_range() approach by including your patch to add the
> same in the patch set (which is now 0002 in the whole set).  I have to
> admit that it's simpler to understand the new code with just Bitmapsets to
> look at, but I'm still a bit concerned about materializing the whole set
> right within partition.c, although we can perhaps optimize it later.

Thanks for making that change. The code looks much more simple now.

For performance, if you're worried about a very large number of
partitions, then I think you're better off using bms_next_member()
rather than bms_first_member(), (likely this applies globally, but you
don't need to worry about those).

The problem with bms_first_member is that it must always loop over the
0 words before it finds any bits set for each call, whereas
bms_next_member will start on the word it was last called for. There
will likely be a pretty big performance difference between the two
when processing a large Bitmapset.

> Attached updated set of patches, including the fix to make the new pruning
> code handle Boolean partitioning.

Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 ноября 2017 г., 08:14:51

On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

I have a little more review to share:

1. Missing "in" in comment. Should be "mentioned in"

 * get_append_rel_partitions
 * Return the list of partitions of rel that pass the clauses mentioned
 * rel->baserestrictinfo

2. Variable should be declared in inner scope with the following fragment:

void
set_basic_child_rel_properties(PlannerInfo *root,
   RelOptInfo *rel,
   RelOptInfo *childrel,
   AppendRelInfo *appinfo)
{
AttrNumber attno;

if (rel->part_scheme)
{

which makes the code the same as where you moved it from.

3. Normally lfirst() is assigned to a variable at the start of a
foreach() loop. You have code which does not follow this.

foreach(lc, clauses)
{
Expr   *clause;
int i;

if (IsA(lfirst(lc), RestrictInfo))
{
RestrictInfo *rinfo = lfirst(lc);

You could assign this to a Node * since the type is unknown to you at
the start of the loop.

4.
/*
* Useless if what we're thinking of as a constant is actually
* a Var coming from this relation.
*/
if (bms_is_member(rel->relid, constrelids))
continue;

should this be moved to just above the op_strict() test? This one seems cheaper.

5. Typo "paritions": /* No clauses to prune paritions, so scan all
partitions. */

But thinking about it more the comment should something more along the
lines of /* No useful clauses for partition pruning. Scan all
partitions. */

The key difference is that there might be clauses, just without Consts.

Actually, the more I look at get_append_rel_partitions() I think it
would be better if you re-shaped that if/else if test so that it only
performs the loop over the partindexes if it's been set.

I ended up with the attached version of the function after moving
things around a little bit.

I'm still reviewing but thought I'd share this part so far.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

get_append_rel_partitions.c

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

Amit Langote

Дата:

07 ноября 2017 г., 08:45:11

On 2017/11/06 21:52, David Rowley wrote:
> On 6 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> OK, I have gotten rid of the min/max partition index interface and instead
>> adopted the bms_add_range() approach by including your patch to add the
>> same in the patch set (which is now 0002 in the whole set).  I have to
>> admit that it's simpler to understand the new code with just Bitmapsets to
>> look at, but I'm still a bit concerned about materializing the whole set
>> right within partition.c, although we can perhaps optimize it later.
> 
> Thanks for making that change. The code looks much more simple now.
> 
> For performance, if you're worried about a very large number of
> partitions, then I think you're better off using bms_next_member()
> rather than bms_first_member(), (likely this applies globally, but you
> don't need to worry about those).
> 
> The problem with bms_first_member is that it must always loop over the
> 0 words before it finds any bits set for each call, whereas
> bms_next_member will start on the word it was last called for. There
> will likely be a pretty big performance difference between the two
> when processing a large Bitmapset.

Ah, thanks for the explanation.  I will change it to bms_next_member() in
the next version.

>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
> 
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

Thank you.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

08 ноября 2017 г., 10:44:59

On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

Hi Amit,

I had another look over this today. Apologies if any of the review seems petty.

Here goes:

1. If test seems to be testing for a child that's a partitioned table,
but is testing for a non-NULL part_scheme.

/*
* If childrel is itself partitioned, add it and its partitioned
* children to the list being propagated up to the root rel.
*/
if (childrel->part_scheme && rel->part_scheme)

Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange
to test for a NULL part_scheme

2. There's a couple of mistakes in my bms_add_range() code. I've
attached bms_add_range_fix.patch. Can you apply this to your tree?

3. This assert seems to be Asserting the same thing twice:

Assert(rel->live_partitioned_rels != NIL &&
   list_length(rel->live_partitioned_rels) > 0);

A List with length == 0 is always NIL.

4. get_partitions_from_clauses(), can you comment why you perform the
list_concat() there.

I believe this is there so that the partition bound from the parent is
passed down to the child so that we can properly eliminate all child
partitions when the 2nd level of partitioning is using the same
partition key as the 1st level. I think this deserves a paragraph of
comment to explain this.

5. Please add a comment to explain what's going on here in
classify_partition_bounding_keys()

if (partattno == 0)
{
partexpr = lfirst(partexprs_item);
partexprs_item = lnext(partexprs_item);
}

Looks like, similar to index expressions, that partition expressions
are attno 0 to mark to consume the next expression from the list.

Does this need validation that there are enough partexprs_item items
like what is done in get_range_key_properties()? Or is this validated
somewhere else?

6. Comment claims the if test will test something which it does not
seem to test for:

/*
* Redundant key elimination using btree-semantics based tricks.
*
* Only list and range partitioning use btree operator semantics, so
* skip otherwise.   Also, if there are expressions whose value is yet
* unknown, skip this step, because we need to compare actual values
* below.
*/
memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *));
if (partkey->strategy == PARTITION_STRATEGY_LIST ||
partkey->strategy == PARTITION_STRATEGY_RANGE)

I was expecting this to be skipped when the clauses contained a
non-const, but it does not seem to.

7. Should be "compare them"

/*
* If the leftarg and rightarg clauses' constants are both of the type
* expected by "op" clause's operator, then compare then using the
* latter's comparison function.
*/

But if I look at the code "compare then using the latter's comparison
function." is not true, it seems to use op's comparison function not
rightarg's. With most of the calls op and rightarg are the same, but
not all of them. The function shouldn't make that assumption even if
the args op was always the same as rightarg.

8. remove_redundant_clauses() needs an overview comment of what the
function does.

9. The comment should explain what we would do in the case of key < 3
AND key <= 2 using some examples.

/* try to keep only one of <, <= */

10. Wondering why this loop runs backward?

for (s = BTMaxStrategyNumber; --s >= 0;)

Why not just:

for (s = 0; s < BTMaxStrategyNumber; s++)

I can't see a special reason for it to run backward. It seems unusual,
but if there's a good reason that I've failed to realise then it's
maybe worth a comment.

11. Pleae comment on why *constfalse = true is set here:

if (!chk || s == (BTEqualStrategyNumber - 1))
continue;

if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk,
   &test_result))
{
if (!test_result)
{
*constfalse = true;
return;
}
/* discard the redundant key. */
xform[s] = NULL;
}

Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1.

Also please add a comment when discarding the redundant key maybe
explain that equality is more useful than the other strategies when
there's an overlap.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

bms_add_range_fix.patch

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

Amit Langote

Дата:

08 ноября 2017 г., 11:23:18

Hi David.

Thanks for the review.

(..also looking at the comments you sent earlier today.)

On 2017/11/07 11:14, David Rowley wrote:
> On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)
> 
> I have a little more review to share:
> 
> 1. Missing "in" in comment. Should be "mentioned in"
> 
>  * get_append_rel_partitions
>  * Return the list of partitions of rel that pass the clauses mentioned
>  * rel->baserestrictinfo
> 
> 2. Variable should be declared in inner scope with the following fragment:
> 
> void
> set_basic_child_rel_properties(PlannerInfo *root,
>    RelOptInfo *rel,
>    RelOptInfo *childrel,
>    AppendRelInfo *appinfo)
> {
> AttrNumber attno;
> 
> if (rel->part_scheme)
> {
> 
> which makes the code the same as where you moved it from.

It seems like you included the above changes in your attached C file,
which I will incorporate into my repository.

> 3. Normally lfirst() is assigned to a variable at the start of a
> foreach() loop. You have code which does not follow this.
> 
> foreach(lc, clauses)
> {
> Expr   *clause;
> int i;
> 
> if (IsA(lfirst(lc), RestrictInfo))
> {
> RestrictInfo *rinfo = lfirst(lc);
> 
> You could assign this to a Node * since the type is unknown to you at
> the start of the loop.

Will make the suggested changes to match_clauses_to_partkey().

> 4.
> /*
> * Useless if what we're thinking of as a constant is actually
> * a Var coming from this relation.
> */
> if (bms_is_member(rel->relid, constrelids))
> continue;
> 
> should this be moved to just above the op_strict() test? This one seems cheaper.

Agreed, will do.  Also makes sense to move the PartCollMatchesExprColl()
test together.

> 5. Typo "paritions": /* No clauses to prune paritions, so scan all
> partitions. */
> 
> But thinking about it more the comment should something more along the
> lines of /* No useful clauses for partition pruning. Scan all
> partitions. */

You fixed it. :)

> The key difference is that there might be clauses, just without Consts.
> 
> Actually, the more I look at get_append_rel_partitions() I think it
> would be better if you re-shaped that if/else if test so that it only
> performs the loop over the partindexes if it's been set.
> 
> I ended up with the attached version of the function after moving
> things around a little bit.

Thanks a lot for that.  Looks much better now.

> I'm still reviewing but thought I'd share this part so far.

As mentioned at the top, I'm looking at your latest comments and they all
seem to be good points to me, so will address those in the next version.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

08 ноября 2017 г., 12:52:48

On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Attached updated set of patches, including the fix to make the new pruning
code handle Boolean partitioning.

Hi Amit,

I have tried pruning for different values of constraint exclusion GUC change, not sure exactly how it should behave, but I can see with the delete statement pruning is not happening when constraint_exclusion is off, but select is working as expected. Is this expected behaviour?

create table lp (c1 int, c2 text) partition by list(c1);
create table lp1 partition of lp for values in (1,2);
create table lp2 partition of lp for values in (3,4);
create table lp3 partition of lp for values in (5,6);
insert into lp values (1,'p1'),(2,'p1'),(3,'p2'),(4,'p2'),(5,'p3');

show constraint_exclusion ;
constraint_exclusion
----------------------
partition
(1 row)

explain select c1 from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..29.05 rows=6 width=4)
   -> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=4)
         Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)

explain delete from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN
----------------------------------------------------------
Delete on lp (cost=0.00..29.05 rows=6 width=6)
   Delete on lp1
   -> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
(4 rows)

set constraint_exclusion = off;

explain select c1 from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..29.05 rows=6 width=4)
   -> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=4)
         Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)

explain delete from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN
----------------------------------------------------------
Delete on lp (cost=0.00..87.15 rows=18 width=6)
   Delete on lp1
   Delete on lp2
   Delete on lp3
   -> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
   -> Seq Scan on lp2 (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
   -> Seq Scan on lp3 (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
(10 rows)

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

08 ноября 2017 г., 14:56:12

Hi Rajkumar,

Thanks for testing.

On 2017/11/08 15:52, Rajkumar Raghuwanshi wrote:
> On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
> 
>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
>>
> 
> Hi Amit,
> 
> I have tried pruning for different values of constraint exclusion GUC
> change, not sure exactly how it should behave, but I can see with the
> delete statement pruning is not happening when constraint_exclusion is off,
> but select is working as expected. Is this expected behaviour?

Hmm, the new pruning only works for selects, not DML.  The patch also
changes get_relation_constraints() to not include the internal partition
constraints, but mistakenly does so for all query types, not just select.
Will look into it.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

amul sul

Дата:

09 ноября 2017 г., 17:05:03

On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/11/06 14:32, David Rowley wrote:
>> On 6 November 2017 at 17:30, Amit Langote wrote:
>>> On 2017/11/03 13:32, David Rowley wrote:
>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
[....]
>
> Attached updated set of patches, including the fix to make the new pruning
> code handle Boolean partitioning.
>

I am getting following warning on mac os x:

partition.c:1800:24: warning: comparison of constant -1 with
expression of type 'NullTestType'     (aka 'enum NullTestType') is always false
[-Wtautological-constant-out-of-range-compare]                               if (keynullness[i] == -1)
                ~~~~~~~~~~~~~~ ^  ~~
 
partition.c:1932:25: warning: comparison of constant -1 with
expression of type 'NullTestType'     (aka 'enum NullTestType') is always false
[-Wtautological-constant-out-of-range-compare]                                       if (keynullness[i] == -1)
                                ~~~~~~~~~~~~~~ ^  ~~
 
2 warnings generated.


Comment for 0004 patch:
270 +   /* -1 represents an invalid value of NullTestType. */271 +   memset(keynullness, -1, PARTITION_MAX_KEYS *
sizeof(NullTestType));

I think we should not use memset to set a value other than 0 or true/false.
This will work for -1 on the system where values are stored in the 2's
complement but I am afraid of other architecture.

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

10 ноября 2017 г., 06:34:57

Hi Amul.

On 2017/11/09 20:05, amul sul wrote:
> On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/11/06 14:32, David Rowley wrote:
>>> On 6 November 2017 at 17:30, Amit Langote wrote:
>>>> On 2017/11/03 13:32, David Rowley wrote:
>>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
> [....]
>>
>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
>>
> 
> I am getting following warning on mac os x:

Thanks for the review.

> partition.c:1800:24: warning: comparison of constant -1 with
> expression of type 'NullTestType'
>       (aka 'enum NullTestType') is always false
> [-Wtautological-constant-out-of-range-compare]
>                                 if (keynullness[i] == -1)
>                                     ~~~~~~~~~~~~~~ ^  ~~
> partition.c:1932:25: warning: comparison of constant -1 with
> expression of type 'NullTestType'
>       (aka 'enum NullTestType') is always false
> [-Wtautological-constant-out-of-range-compare]
>                                         if (keynullness[i] == -1)
>                                             ~~~~~~~~~~~~~~ ^  ~~
> 2 warnings generated.
> 
> 
> Comment for 0004 patch:
> 270 +   /* -1 represents an invalid value of NullTestType. */
>  271 +   memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType));
> 
> I think we should not use memset to set a value other than 0 or true/false.
> This will work for -1 on the system where values are stored in the 2's
> complement but I am afraid of other architecture.

OK, I will remove all instances of comparing and setting variables of type
NullTestType to a value of -1.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Kyotaro HORIGUCHI

Дата:

10 ноября 2017 г., 09:30:00

Hello,

At Fri, 10 Nov 2017 09:34:57 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<5fcb1a9f-b4ad-119d-14c7-282c30c7f8d1@lab.ntt.co.jp>
> Hi Amul.
> 
> On 2017/11/09 20:05, amul sul wrote:
> > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On 2017/11/06 14:32, David Rowley wrote:
> >>> On 6 November 2017 at 17:30, Amit Langote wrote:
> >>>> On 2017/11/03 13:32, David Rowley wrote:
> >>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
> > [....]
> >>
> >> Attached updated set of patches, including the fix to make the new pruning
> >> code handle Boolean partitioning.
> >>
> > 
> > I am getting following warning on mac os x:
> 
> Thanks for the review.
> 
> > partition.c:1800:24: warning: comparison of constant -1 with
> > expression of type 'NullTestType'
> >       (aka 'enum NullTestType') is always false
> > [-Wtautological-constant-out-of-range-compare]
> >                                 if (keynullness[i] == -1)
> >                                     ~~~~~~~~~~~~~~ ^  ~~
> > partition.c:1932:25: warning: comparison of constant -1 with
> > expression of type 'NullTestType'
> >       (aka 'enum NullTestType') is always false
> > [-Wtautological-constant-out-of-range-compare]
> >                                         if (keynullness[i] == -1)
> >                                             ~~~~~~~~~~~~~~ ^  ~~
> > 2 warnings generated.
> > 
> > 
> > Comment for 0004 patch:
> > 270 +   /* -1 represents an invalid value of NullTestType. */
> >  271 +   memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType));
> > 
> > I think we should not use memset to set a value other than 0 or true/false.
> > This will work for -1 on the system where values are stored in the 2's
> > complement but I am afraid of other architecture.
> 
> OK, I will remove all instances of comparing and setting variables of type
> NullTestType to a value of -1.

In 0002, bms_add_range has a bit naive-looking loop

+    while (wordnum <= uwordnum)
+    {
+        bitmapword mask = (bitmapword) ~0;
+
+        /* If working on the lower word, zero out bits below 'lower'. */
+        if (wordnum == lwordnum)
+        {
+            int lbitnum = BITNUM(lower);
+            mask >>= lbitnum;
+            mask <<= lbitnum;
+        }
+
+        /* Likewise, if working on the upper word, zero bits above 'upper' */
+        if (wordnum == uwordnum)
+        {
+            int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
+            mask <<= ushiftbits;
+            mask >>= ushiftbits;
+        }
+
+        a->words[wordnum++] |= mask;
+    }

Without some aggressive optimization, the loop takes most of the
time to check-and-jump for nothing especially with many
partitions and somewhat unintuitive.

The following uses a bit tricky bitmap operation but
is straightforward as a whole.

=====
/* fill the bits upper from BITNUM(lower) (0-based) of the first word */
a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);

/* fill up intermediate words */
while (wordnum < uwordnum)  a->words[wordnum++] = ~(bitmapword) 0;

/* fill up to BITNUM(upper) bit (0-based) of the last word */
a->workds[wordnum++] |=    (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
=====


In 0003, 

+match_clauses_to_partkey(RelOptInfo *rel,
...
+      if (rinfo->pseudoconstant &&
+        (IsA(clause, Const) &&
+         ((((Const *) clause)->constisnull) ||
+          !DatumGetBool(((Const *) clause)->constvalue))))
+      {
+        *constfalse = true;
+        continue;

If we call this function in both conjunction and disjunction
context (the latter is only in recursive case). constfalse ==
true means no need of any clauses for the former case.

Since (I think) just a list of RestrictInfo is expected to be
treated as a conjunction (it's quite doubious, though..), we
might be better call this for each subnodes of a disjunction. Or,
like match_clauses_to_index, we might be better provide
match_clause_to_partkey(rel, rinfo, contains_const), which
returns NULL if constfalse. (I'm not self-confident on this..)


+          /*
+           * If no commutator exists, cannot flip the qual's args,
+           * so give up.
+           */
+          if (!OidIsValid(expr_op))
+            continue;

I suppose it's better to leftop and rightop as is rather than
flipping over so that var is placed left-side. Does that make
things so complex?

+     * It the operator happens to be '<>', which is never listed
If?


+        if (!op_in_opfamily(expr_op, partopfamily))
+        {
+          Oid    negator = get_negator(expr_op);
+
+          if (!OidIsValid(negator) ||
+            !op_in_opfamily(negator, partopfamily))
+            continue;

classify_partition_bounding_keys() checks the same thing by
checking whether the negator's strategy is
BTEquealStrategyNumber. (I'm not sure the operator is guaranteed
to be of btreee, though..) Aren't they needed to be in similar
way?

# In the function, "partkey strategy" and "operator strategy" are
# confusing..

+  AttrNumber  attno;

This declaration might be better in a narrower scope.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Kyotaro HORIGUCHI

Дата:

10 ноября 2017 г., 11:38:11

Hello, this is the second part of the review.

At Fri, 10 Nov 2017 12:30:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171110.123000.151902771.horiguchi.kyotaro@lab.ntt.co.jp>
> In 0002, bms_add_range has a bit naive-looking loop
> In 0003, 

In 0004,

The name get_partitions_from_clauses_guts(), it seems to me that
we usually use _internal for recursive part of some function. (I
have the same comment as David about the comment for
get_partition_from_clause())

About the same function:

Couldn't we get out in the fast path when clauses == NIL?

+    /* No constraints on the keys, so, return *all* partitions. */
+    result = bms_add_range(result, 0, partdesc->nparts - 1);

This allows us to return immediately here. And just above this,

+    if (nkeys > 0 && !constfalse)
+        result = get_partitions_for_keys(relation, &keys);
+    else if (!constfalse)

Those two conditions are not orthogonal. Maybe something like
following seems more understantable.

> if (!constfalse)
> {
>   /* No constraints on the keys, so, return *all* partitions. */
>   if (nkeys == 0)
>     return bms_add_range(result, 0, partdesc->nparts - 1);
> 
>   result = get_partitions_for_keys(relation, &keys);
> }

I'm not sure what is meant to be (formally) done here, but it
seems to me that OrExpr is assumed to be only at the top level of
the caluses. So the following (just an example, but meaningful
expression in this shpape must exists.) expression is perhaps
wrongly processed here.

CREATE TABLE p (a int) PARITION BY (a);
CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);

SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);

get_partitions_for_keys() returns both c1 and c2 and still
or_clauses here holds (a = 15 OR a = 5) and the function tries to
*add* partitions for a = 15 and a = 5 separetely.


I'd like to pause here.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Kyotaro HORIGUCHI

Дата:

10 ноября 2017 г., 11:44:55

Ooops! The following comment is wrong. Please ignore it.

At Fri, 10 Nov 2017 14:38:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171110.143811.97616847.horiguchi.kyotaro@lab.ntt.co.jp>
> Those two conditions are not orthogonal. Maybe something like
> following seems more understantable.
> 
> > if (!constfalse)
> > {
> >   /* No constraints on the keys, so, return *all* partitions. */
> >   if (nkeys == 0)
> >     return bms_add_range(result, 0, partdesc->nparts - 1);
> > 
> >   result = get_partitions_for_keys(relation, &keys);
> > }
> 
> I'm not sure what is meant to be (formally) done here, but it
> seems to me that OrExpr is assumed to be only at the top level of
> the caluses. So the following (just an example, but meaningful
> expression in this shpape must exists.) expression is perhaps
> wrongly processed here.
> 
> CREATE TABLE p (a int) PARITION BY (a);
> CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
> CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);
> 
> SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);
> 
> get_partitions_for_keys() returns both c1 and c2 and still
> or_clauses here holds (a = 15 OR a = 5) and the function tries to
> *add* partitions for a = 15 and a = 5 separetely.

This is working fine. Sorry for the bogus comment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Kyotaro HORIGUCHI

Дата:

10 ноября 2017 г., 13:07:01

Hello,

At Fri, 10 Nov 2017 14:44:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171110.144455.117208639.horiguchi.kyotaro@lab.ntt.co.jp>

> > Those two conditions are not orthogonal. Maybe something like
> > following seems more understantable.
> > 
> > > if (!constfalse)
> > > {
> > >   /* No constraints on the keys, so, return *all* partitions. */
> > >   if (nkeys == 0)
> > >     return bms_add_range(result, 0, partdesc->nparts - 1);
> > > 
> > >   result = get_partitions_for_keys(relation, &keys);
> > > }

So, the condition (!constfalse && nkeys == 0) cannot return
there. I'm badly confused by the variable name.

I couldn't find another reasonable structure using the current
classify_p_b_keys(), but could you add a comment like the
following as an example?

+ /*
+  * Ths function processes other than OR expressions and returns
+  * the excluded OR expressions in or_clauses
+  */
>   nkeys = classify_partition_bounding_keys(relation, clauses,
>                        &keys, &constfalse,
>                        &or_clauses);
>   /*
>    * Only look up in the partition decriptor if the query provides
>    * constraints on the keys at all.
>    */
>   if (!constfalse)
>   {
>     if (nkey > 0)
>       result = get_partitions_for_keys(relation, &keys);
>     else
-+       /* No constraints on the keys, so, all partitions are passed. */
>       result = bms_add_range(result, 0, partdesc->nparts - 1);
>   }
> 
+   /*
+    * We have a partition set for clauses not returned in or_clauses
+    * here. Conjuct the result of each OR clauses.
+    */
>   foreach(lc, or_clauses)
>   {
>     BoolExpr *or = (BoolExpr *) lfirst(lc);
>     ListCell *lc1;
>     Bitmapset *or_partset = NULL;
> 
+     Assert(or_clause(or));

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

10 ноября 2017 г., 13:40:04

On 10 November 2017 at 16:30, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> In 0002, bms_add_range has a bit naive-looking loop
>
> +       while (wordnum <= uwordnum)
> +       {
> +               bitmapword mask = (bitmapword) ~0;
> +
> +               /* If working on the lower word, zero out bits below 'lower'. */
> +               if (wordnum == lwordnum)
> +               {
> +                       int lbitnum = BITNUM(lower);
> +                       mask >>= lbitnum;
> +                       mask <<= lbitnum;
> +               }
> +
> +               /* Likewise, if working on the upper word, zero bits above 'upper' */
> +               if (wordnum == uwordnum)
> +               {
> +                       int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
> +                       mask <<= ushiftbits;
> +                       mask >>= ushiftbits;
> +               }
> +
> +               a->words[wordnum++] |= mask;
> +       }
>
> Without some aggressive optimization, the loop takes most of the
> time to check-and-jump for nothing especially with many
> partitions and somewhat unintuitive.
>
> The following uses a bit tricky bitmap operation but
> is straightforward as a whole.
>
> =====
> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
>
> /* fill up intermediate words */
> while (wordnum < uwordnum)
>    a->words[wordnum++] = ~(bitmapword) 0;
>
> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> a->workds[wordnum++] |=
>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> =====

No objections here for making bms_add_range() perform better, but this
is not going to work when lwordnum == uwordnum. You'd need to special
case that. I didn't think it was worth the trouble, but maybe it is...

I assume the += should be |=.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

13 ноября 2017 г., 15:46:28

Horiguchi-san,

Thanks for taking a look.  Replying to all your emails here.

On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
> In 0002, bms_add_range has a bit naive-looking loop
> 
> +    while (wordnum <= uwordnum)
> +    {
> +        bitmapword mask = (bitmapword) ~0;
> +
> +        /* If working on the lower word, zero out bits below 'lower'. */
> +        if (wordnum == lwordnum)
> +        {
> +            int lbitnum = BITNUM(lower);
> +            mask >>= lbitnum;
> +            mask <<= lbitnum;
> +        }
> +
> +        /* Likewise, if working on the upper word, zero bits above 'upper' */
> +        if (wordnum == uwordnum)
> +        {
> +            int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
> +            mask <<= ushiftbits;
> +            mask >>= ushiftbits;
> +        }
> +
> +        a->words[wordnum++] |= mask;
> +    }
> 
> Without some aggressive optimization, the loop takes most of the
> time to check-and-jump for nothing especially with many
> partitions and somewhat unintuitive.
> 
> The following uses a bit tricky bitmap operation but
> is straightforward as a whole.
> 
> =====
> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
> 
> /* fill up intermediate words */
> while (wordnum < uwordnum)
>    a->words[wordnum++] = ~(bitmapword) 0;
> 
> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> a->workds[wordnum++] |=
>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> =====

Considering also the David's comment downthread, I will try to incorporate
this into bms_add_range().

> In 0003, 
> 
> +match_clauses_to_partkey(RelOptInfo *rel,
> ...
> +      if (rinfo->pseudoconstant &&
> +        (IsA(clause, Const) &&
> +         ((((Const *) clause)->constisnull) ||
> +          !DatumGetBool(((Const *) clause)->constvalue))))
> +      {
> +        *constfalse = true;
> +        continue;
> 
> If we call this function in both conjunction and disjunction
> context (the latter is only in recursive case). constfalse ==
> true means no need of any clauses for the former case.
> 
> Since (I think) just a list of RestrictInfo is expected to be
> treated as a conjunction (it's quite doubious, though..),

I think it makes sense to consider a list of RestrictInfo's, such as
baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
to be mutually conjunctive for our purpose here.

> we
> might be better call this for each subnodes of a disjunction. Or,
> like match_clauses_to_index, we might be better provide
> match_clause_to_partkey(rel, rinfo, contains_const), which
> returns NULL if constfalse. (I'm not self-confident on this..)

After reading your comment, I realized that it was wrong that the
recursive call to match_clauses_to_partkey() passed the arguments of an OR
clause all at once.  That broke the assumption mentioned above that all of
the clauses in the list passed to match_clauses_to_partkey() are mutually
conjunctive.  Instead, we must make a single-member list for each of the
OR clause's arguments and pass the same.

Then if we got constfalse for all of the OR's arguments, then we return
the constfalse=true to the original caller.

> 
> +          /*
> +           * If no commutator exists, cannot flip the qual's args,
> +           * so give up.
> +           */
> +          if (!OidIsValid(expr_op))
> +            continue;
> 
> I suppose it's better to leftop and rightop as is rather than
> flipping over so that var is placed left-side. Does that make
> things so complex?

Reason to do it that way is that the code placed far away (code in
partition.c that extracts constant values to use for pruning from matched
clauses) can always assume that the clauses determined to be useful for
partition-pruning always come in the 'partkey op constant' form.

> +     * It the operator happens to be '<>', which is never listed
> If?

Right, will fix.

> +        if (!op_in_opfamily(expr_op, partopfamily))
> +        {
> +          Oid    negator = get_negator(expr_op);
> +
> +          if (!OidIsValid(negator) ||
> +            !op_in_opfamily(negator, partopfamily))
> +            continue;
> 
> classify_partition_bounding_keys() checks the same thing by
> checking whether the negator's strategy is
> BTEquealStrategyNumber. (I'm not sure the operator is guaranteed
> to be of btreee, though..) Aren't they needed to be in similar
> way?

You're right.  The <>'s negator may not always be a btree operator.  So,
we should add a check in match_clauses_to_partkey() that list or range
partitioning is in use, because only those require a btree operator
family.  We now have hash partitioning, so need to be careful not to make
the assumption that all partitioning operators are from btree operator
families.

If match_clauses_to_partkey() accepts such a clause with a <> operator,
then classify_partition_bounding_keys() can rely that it will get the
desired btree operators to implement pruning for the same.

> # In the function, "partkey strategy" and "operator strategy" are
> # confusing..

I agree it would be better to make that clear using comments.

Partitioning strategy and operator strategies are intimately related.
List and range partitioning related optimizations will only work if the
clause operators are of valid btree strategies, hash partitioning
optimizations will only work if the operator in the matched clauses is a
valid hash equality operator.


> +  AttrNumber  attno;
> 
> This declaration might be better in a narrower scope.

Agreed, will move.

On 2017/11/10 14:38, Kyotaro HORIGUCHI wrote:
> Hello, this is the second part of the review.
>
> At Fri, 10 Nov 2017 12:30:00 +0900 , Kyotaro HORIGUCHI wrote:
>> In 0003,
>
> In 0004,
>
> The name get_partitions_from_clauses_guts(), it seems to me that
> we usually use _internal for recursive part of some function. (I
> have the same comment as David about the comment for
> get_partition_from_clause())

OK, will replace _guts by _internal.  Looking at the David's comments too.

> About the same function:
>
> Couldn't we get out in the fast path when clauses == NIL?

Actually get_partitions_for_clauses() won't be called at all if that were
true.  I think I should add an Assert that clauses is not NIL.

>
> +    /* No constraints on the keys, so, return *all* partitions. */
> +    result = bms_add_range(result, 0, partdesc->nparts - 1);
>
> This allows us to return immediately here. And just above this,
>
> +    if (nkeys > 0 && !constfalse)
> +        result = get_partitions_for_keys(relation, &keys);
> +    else if (!constfalse)
>
> Those two conditions are not orthogonal. Maybe something like
> following seems more understantable.
>
>> if (!constfalse)
>> {
>>   /* No constraints on the keys, so, return *all* partitions. */
>>   if (nkeys == 0)
>>     return bms_add_range(result, 0, partdesc->nparts - 1);
>>
>>   result = get_partitions_for_keys(relation, &keys);
>> }

Agreed that your suggested rewrite of that portion of the code is easy to
read, but we cannot return yet in the nkeys == 0 case, as you also said
you found out.  I quote your other reply further below.

> I'm not sure what is meant to be (formally) done here, but it
> seems to me that OrExpr is assumed to be only at the top level of
> the caluses. So the following (just an example, but meaningful
> expression in this shpape must exists.) expression is perhaps
> wrongly processed here.
>
> CREATE TABLE p (a int) PARITION BY (a);
> CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
> CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);
>
> SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);
>
> get_partitions_for_keys() returns both c1 and c2 and still
> or_clauses here holds (a = 15 OR a = 5) and the function tries to
> *add* partitions for a = 15 and a = 5 separetely.
>
> I'd like to pause here.

[ ... ]

On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote:
> At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote:
>
> This is working fine. Sorry for the bogus comment.

I'd almost started looking around if something might be wrong after all. :)

On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote:
> At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote:
>
>>> Those two conditions are not orthogonal. Maybe something like
>>> following seems more understantable.
>>>
>>>> if (!constfalse)
>>>> {
>>>>   /* No constraints on the keys, so, return *all* partitions. */
>>>>   if (nkeys == 0)
>>>>     return bms_add_range(result, 0, partdesc->nparts - 1);
>>>>
>>>>   result = get_partitions_for_keys(relation, &keys);
>>>> }
>
> So, the condition (!constfalse && nkeys == 0) cannot return
> there. I'm badly confused by the variable name.

Do you mean by 'constfalse'?

> I couldn't find another reasonable structure using the current
> classify_p_b_keys(), but could you add a comment like the
> following as an example?

OK, will add comments explaining what's going on.


Will post the updated patches after also taking care of David's and Amul's
review comments upthread.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

14 ноября 2017 г., 10:00:12

On 13 November 2017 at 22:46, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
>> In 0002, bms_add_range has a bit naive-looking loop
>>
>> +     while (wordnum <= uwordnum)
>> +     {
>> +             bitmapword mask = (bitmapword) ~0;
>> +
>> +             /* If working on the lower word, zero out bits below 'lower'. */
>> +             if (wordnum == lwordnum)
>> +             {
>> +                     int lbitnum = BITNUM(lower);
>> +                     mask >>= lbitnum;
>> +                     mask <<= lbitnum;
>> +             }
>> +
>> +             /* Likewise, if working on the upper word, zero bits above 'upper' */
>> +             if (wordnum == uwordnum)
>> +             {
>> +                     int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
>> +                     mask <<= ushiftbits;
>> +                     mask >>= ushiftbits;
>> +             }
>> +
>> +             a->words[wordnum++] |= mask;
>> +     }
>>
>> Without some aggressive optimization, the loop takes most of the
>> time to check-and-jump for nothing especially with many
>> partitions and somewhat unintuitive.
>>
>> The following uses a bit tricky bitmap operation but
>> is straightforward as a whole.
>>
>> =====
>> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
>> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
>>
>> /* fill up intermediate words */
>> while (wordnum < uwordnum)
>>    a->words[wordnum++] = ~(bitmapword) 0;
>>
>> /* fill up to BITNUM(upper) bit (0-based) of the last word */
>> a->workds[wordnum++] |=
>>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
>> =====
>
> Considering also the David's comment downthread, I will try to incorporate
> this into bms_add_range().

I've attached an implementation of the patch using this method.

I've also attached bitset.c which runs each through their paces. I'd
have expected Kyotaro's method to be faster, but gcc 7.2 with -O2
generates very slightly slower code. I didn't really check why. clang
seems to do a better job with it.

$ gcc -O2 bitset.c -o bitset && ./bitset
bms_add_range in 0.694254 (6.94254 ns per loop)
bms_add_range2 in 0.726643 (7.26643 ns per loop)
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
-------------
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
$ gcc --version
gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ clang -O2 bitset.c -o bitset && ./bitset
bms_add_range in 0.866554 (8.66554 ns per loop)
bms_add_range2 in 0.467138 (4.67138 ns per loop)
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
-------------
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
$ clang --version
clang version 4.0.1-6 (tags/RELEASE_401/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Probably just go with Kyotaro's idea (v2). I don't think this is worth
debating, I just wanted to show it's not that clear-cut.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Hi David.

On 2017/11/08 13:44, David Rowley wrote:
> On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)
> 
> Hi Amit,
> 
> I had another look over this today. Apologies if any of the review seems petty.

Thanks a lot for the review.

> Here goes:
> 
> 1. If test seems to be testing for a child that's a partitioned table,
> but is testing for a non-NULL part_scheme.
> 
> /*
> * If childrel is itself partitioned, add it and its partitioned
> * children to the list being propagated up to the root rel.
> */
> if (childrel->part_scheme && rel->part_scheme)
> 
> Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange
> to test for a NULL part_scheme

I guess that makes sense, done.

> 
> 2. There's a couple of mistakes in my bms_add_range() code. I've
> attached bms_add_range_fix.patch. Can you apply this to your tree?

Thanks.  I have used your bms_add_range_v2.patch that you sent earlier
today and listed both your and Horiguchi-san's names as author.

> 3. This assert seems to be Asserting the same thing twice:
> 
> Assert(rel->live_partitioned_rels != NIL &&
>    list_length(rel->live_partitioned_rels) > 0);
> 
> A List with length == 0 is always NIL.

You're right.  I changed it to:

  Assert(list_length(rel->live_partitioned_rels) >= 1);

> 4. get_partitions_from_clauses(), can you comment why you perform the
> list_concat() there.
> 
> I believe this is there so that the partition bound from the parent is
> passed down to the child so that we can properly eliminate all child
> partitions when the 2nd level of partitioning is using the same
> partition key as the 1st level. I think this deserves a paragraph of
> comment to explain this.

Yes, that's the intent.  I implemented it as a solution to fix a problem
that was reported upthread, whereby the default partition pruning didn't
work as desired.  I tried to explain it in the following email:

https://www.postgresql.org/message-id/8499324c-8a33-4be7-9d23-7e6a95e60ddf%40lab.ntt.co.jp

But, since I devised it as a solution to get the desired behavior for the
default partition, I modified the code to include partition constraint
clauses to do it only when the table has a default partition in the first
place.  Doing it always is an overkill.  Please see the comment added
nearby if it now helps make sense of what's going on.

> 5. Please add a comment to explain what's going on here in
> classify_partition_bounding_keys()
> 
> if (partattno == 0)
> {
> partexpr = lfirst(partexprs_item);
> partexprs_item = lnext(partexprs_item);
> }
> 
> Looks like, similar to index expressions, that partition expressions
> are attno 0 to mark to consume the next expression from the list.

Right.

> Does this need validation that there are enough partexprs_item items
> like what is done in get_range_key_properties()? Or is this validated
> somewhere else?

Yeah, I added the check as follows:

+            /* Set partexpr if needed. */
             if (partattno == 0)
             {
+                if (partexprs_item == NULL)
+                    elog(ERROR, "wrong number of partition key expressions");
                 partexpr = lfirst(partexprs_item);
                 partexprs_item = lnext(partexprs_item);

> 
> 6. Comment claims the if test will test something which it does not
> seem to test for:
> 
> /*
> * Redundant key elimination using btree-semantics based tricks.
> *
> * Only list and range partitioning use btree operator semantics, so
> * skip otherwise.   Also, if there are expressions whose value is yet
> * unknown, skip this step, because we need to compare actual values
> * below.
> */
> memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *));
> if (partkey->strategy == PARTITION_STRATEGY_LIST ||
> partkey->strategy == PARTITION_STRATEGY_RANGE)
> 
> I was expecting this to be skipped when the clauses contained a
> non-const, but it does not seem to.

Fixed the comment.  Actually we might end up with non-consts here if
executor invokes it, so the downstream code is in position to handle them,
skipping any optimizations that depend on constant values being available.
 There are actually even cases when the planner wouldn't mind calling here
even if the matched clauses contained non-const operands as long as there
are at least some constants available.

> 7. Should be "compare them"
> 
> /*
> * If the leftarg and rightarg clauses' constants are both of the type
> * expected by "op" clause's operator, then compare then using the
> * latter's comparison function.
> */
> 
> But if I look at the code "compare then using the latter's comparison
> function." is not true, it seems to use op's comparison function not
> rightarg's. With most of the calls op and rightarg are the same, but
> not all of them. The function shouldn't make that assumption even if
> the args op was always the same as rightarg.

Rearranged the code in partition_cmp_args() a bit and added more
clarifying comments.

> 8. remove_redundant_clauses() needs an overview comment of what the
> function does.

Done.

> 9. The comment should explain what we would do in the case of key < 3
> AND key <= 2 using some examples.
> 
> /* try to keep only one of <, <= */

Done.

> 10. Wondering why this loop runs backward?
> 
> for (s = BTMaxStrategyNumber; --s >= 0;)
> 
> Why not just:
> 
> for (s = 0; s < BTMaxStrategyNumber; s++)
> 
> I can't see a special reason for it to run backward. It seems unusual,
> but if there's a good reason that I've failed to realise then it's
> maybe worth a comment.

Hmm, no special reason.  So, done the other way.  I actually brought this
redundant key logic elimination logic over from nbtutils.c:
_bt_preprocess_keys() and the loop runs that way over there.

> 11. Pleae comment on why *constfalse = true is set here:
> 
> if (!chk || s == (BTEqualStrategyNumber - 1))
> continue;
> 
> if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk,
>    &test_result))
> {
> if (!test_result)
> {
> *constfalse = true;
> return;
> }
> /* discard the redundant key. */
> xform[s] = NULL;
> }
> 
> Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1.

Right.  Added a comment with example.

> Also please add a comment when discarding the redundant key maybe
> explain that equality is more useful than the other strategies when
> there's an overlap.

Done, too.

Please find attached updated patch set.   There are significant changes in
this version as described below, including the support for hash
partitioned tables.

Earlier today, I reported [1] what looks to me like a bug in how default
partition's constraint gets generated and how that sometimes makes
constraint exclusion mistakenly prune a default partition.  I have
included the patches I posted there in this series too.  They are ahead in
the list.

So attached patches are now as follows:

0001-Add-default-partition-case-in-inheritance-testing.patch
0002-Tweak-default-range-partition-s-constraint-a-little.patch

Patches at [1].

0003-Add-new-tests-for-partition-pruning.patch

Tests.  Mostly unchanged from the last version.

0004-Add-a-bms_add_range.patch

Uses the bms_add_range_v2 patch.

0005-Planner-side-changes-for-partition-pruning.patch

Fixed some issues with how OR clauses were matched and the code for
checking if the operator is partitioning compatible is no longer in the
planner code, instead it's now only in partition.c.  Other cosmetic
improvements including those that resulted from the review comments.

0006-Implement-get_partitions_from_clauses.patch

Several bug fixes and many cosmetic improvements including improved
commentary.  Per Amul's comment upthread, the patch no longer depends on
using value -1 to denote an invalid value of NulltestType enum.  In fact,
it doesn't use NulltestType type variables at all.

0007-Some-interface-changes-for-partition_bound_-cmp-bsea.patch

No changes.

0008-Implement-get_partitions_for_keys.patch

Support for hash partitioning and tests for the same.  Also, since
update/delete on partitioned tables still depend on constraint exclusion
for pruning, fix things such that get_relation_constraints includes
partition constraints in its result only for non-select queries (for
selects we have the new pruning code).  Other bug fixes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/ba7aaeb1-4399-220e-70b4-62eade1522d0%40lab.ntt.co.jp

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

17 ноября 2017 г., 18:44:56

On 17 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Please find attached updated patch set. There are significant changes in
this version as described below, including the support for hash
partitioned tables.

Hi Amit,

Thanks for making those changes and adding the HASH partition support.

There's a good chance that I'm not going to get time to look at this maybe until the last day of the month. I hope someone else can look over it in the meantime.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

Amit Langote

Дата:

21 ноября 2017 г., 07:15:10

Hi David.

On 2017/11/17 21:44, David Rowley wrote:
> On 17 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
> 
>> Please find attached updated patch set.   There are significant changes in
>> this version as described below, including the support for hash
>> partitioned tables.
>>
> 
> Hi Amit,
> 
> Thanks for making those changes and adding the HASH partition support.
> 
> There's a good chance that I'm not going to get time to look at this maybe
> until the last day of the month. I hope someone else can look over it in
> the meantime.

No problem, thanks for the review so far and look forward to the next time
you'll be able to comment.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Kyotaro HORIGUCHI

Дата:

21 ноября 2017 г., 15:42:40

Thank you and sorry for the confused comments.

At Mon, 13 Nov 2017 18:46:28 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<8460a6c3-68c5-b78a-7d18-d253180f2188@lab.ntt.co.jp>
> Horiguchi-san,
> 
> Thanks for taking a look.  Replying to all your emails here.

> > In 0003, 
> > 
> > +match_clauses_to_partkey(RelOptInfo *rel,
> > ...
> > +      if (rinfo->pseudoconstant &&
> > +        (IsA(clause, Const) &&
> > +         ((((Const *) clause)->constisnull) ||
> > +          !DatumGetBool(((Const *) clause)->constvalue))))
> > +      {
> > +        *constfalse = true;
> > +        continue;
> > 
> > If we call this function in both conjunction and disjunction
> > context (the latter is only in recursive case). constfalse ==
> > true means no need of any clauses for the former case.
> > 
> > Since (I think) just a list of RestrictInfo is expected to be
> > treated as a conjunction (it's quite doubious, though..),
> 
> I think it makes sense to consider a list of RestrictInfo's, such as
> baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
> to be mutually conjunctive for our purpose here.

You're right and I know it. I'm ok to leave it since I recalled
that clause_selectivity always has a similar code.

> On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote:
> >
> > This is working fine. Sorry for the bogus comment.
> 
> I'd almost started looking around if something might be wrong after all. :)

Very sorry for the wrong comment:(

> On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote:
> >
> >>> Those two conditions are not orthogonal. Maybe something like
> >>> following seems more understantable.
> >>>
> >>>> if (!constfalse)
> >>>> {
> >>>>   /* No constraints on the keys, so, return *all* partitions. */
> >>>>   if (nkeys == 0)
> >>>>     return bms_add_range(result, 0, partdesc->nparts - 1);
> >>>>
> >>>>   result = get_partitions_for_keys(relation, &keys);
> >>>> }
> >
> > So, the condition (!constfalse && nkeys == 0) cannot return
> > there. I'm badly confused by the variable name.
> 
> Do you mean by 'constfalse'?

Perhaps. The name "constfalse" is suggesting (for me) that the
cluses evaluate to false constantly. But acutally it means just
the not-in-the-return clauses are results in false. Anyway I'll
take a look on v12 and will comment at the time.

> > I couldn't find another reasonable structure using the current
> > classify_p_b_keys(), but could you add a comment like the
> > following as an example?
> 
> OK, will add comments explaining what's going on.
> 
> 
> Will post the updated patches after also taking care of David's and Amul's
> review comments upthread.
> 
> Thanks,
> Amit

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

21 ноября 2017 г., 16:06:50

On Fri, Nov 17, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Support for hash partitioning and tests for the same. Also, since
update/delete on partitioned tables still depend on constraint exclusion
for pruning, fix things such that get_relation_constraints includes
partition constraints in its result only for non-select queries (for
selects we have the new pruning code). Other bug fixes.

Hi Amit,

I have applied attached patch set on commit 11e264517dff7a911d9e6494de86049cab42cde3 and try to test for hash partition. I got a server crash with below test.

CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (c);
CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 1) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 2);
insert into hp_tbl select i,i,i from generate_series(0,10)i where i not in(2,4,6,7,10);

explain select * from hp_tbl where a = 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

22 ноября 2017 г., 14:59:48

Thanks Rajkumar for the test.

On 2017/11/21 19:06, Rajkumar Raghuwanshi wrote:
> CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
> CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
> remainder 0) PARTITION BY HASH (b);
> CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus
> 4, remainder 0) PARTITION BY HASH (c);
> CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH
> (modulus 4, remainder 0);
> CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
> remainder 1) PARTITION BY HASH (b);
> CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
> 4, remainder 1);
> CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
> 4, remainder 2);
> insert into  hp_tbl select i,i,i from generate_series(0,10)i where i not
> in(2,4,6,7,10);
> 
> explain select * from hp_tbl where a = 2;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>

It seems I wrote an Assert in the code to support hash partitioning that
wasn't based on a valid assumption.  I was wrongly assuming that all hash
partitions for a given modulus (largest modulus) must exist at any given
time, but that isn't the case.

Fixed in the attached.  No other changes beside that.

Thanks,
Amit

Horiguchi-san,

Thanks a lot for the review and sorry it took me a while to reply.

On 2017/11/28 20:39, Kyotaro HORIGUCHI wrote:
> At Wed, 22 Nov 2017 17:59:48 +0900, Amit Langote wrote:
> 0001 and 0002 are under discussion with Robert in another thread.

And now committed [1].

> I don't have a comment on 0003, 0004.
> 
> 0005:
> 
>  get_partitions_from_clauses is written as _using_ in it's
>  comment. (also get_partitions_from_clauses_recurse is _guts in
>  its comment.)

Fixed both.

>  get_append_rel_partitions just returns NIL if constfalse. I
>  suppose we'd better reducing indentation level
>  here. get_partitions_from_clauses_recurse in 0006 does the same
>  thing.

Less indentation sounds good to me to, so fixed.

>  In the same function, there's a else clause separated from then
>  clause by a multiline comment. It seems better that the else
>  clause has braces and the comment is in the braces like the
>  following.
> 
>  >  else
>  >  {
>  >    /*
>  >     * Else there are no clauses....
>  >     */
>  >    partindexes = bms_add_range(NULL, 0, partdesc->nparts - 1);
>  >  }

Done that way.

> 0006:
> 
>  In get_partitions_from_clauses_recurse, the following comment
>  seems confusing.
> 
> +    /*
> +     * The analysis of the matched clauses done by
> +     * classify_partition_bounding_keys may have found mutually contradictory
> +     * clauses.
> +     */
> 
>  constfalse = true also when the cluase was just one false pseudo
>  constant restrictinfo.

Updated the comment like this:

  /*
   * classify_partition_bounding_keys() may have found clauses marked
   * pseudo-constant that are false that the planner didn't or it may have
   * itself found contradictions among clauses.
   */

> +    if (!constfalse)
> +    {
> +        /*
> +         * If all clauses in the list were OR clauses,
> +         * classify_partition_bounding_keys() wouldn't have formed keys
> +         * yet.  They will be handled below by recursively calling this
> +         * function for each of OR clauses' arguments and combining the
> +         * resulting partition sets appropriately.
> +         */
> +        if (nkeys > 0)
> 
>  classify_p_b_keys() to return zero also when is not only all-OR
>  clauses(all AND clauses with volatile function also returns
>  zero).

Hmm, if all AND clauses contained volatile functions, then planner
wouldn't have called here at all.  Also, any clauses it passes to
get_partitions_from_clauses() are known to be OK to use for pruning.

> +            /* Set partexpr if needed. */
> +            if (partattno == 0)
> 
>  Could you add a description about the meaning of 0 to the
>  comment of PartitionKeyData something like this?

Sure, done.

> 
>   | AttrNumber *partattrs;    /* attribute numbers of columns in the
>   |                            * partition key. 0 means partexpression */
> 
> + #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \
> +         ((IsA((expr), Var) &&\
> +          ((Var *) (expr))->varattno == (partattno)) ||\
> +          equal((expr), (partexpr)))
> ...
> +        if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
> 
>  partattno = 0 has a different meaning than ordinary attnos.     
>  I belive that the leftop cannot be a whole row var, but I
>  suppose we should make it clear. Likewise, even it doesn't
>  actually happen but it formally has a chance to make a false
>  match since partexpr is not cleared when partattno > 0.
>  EXPR_MATCHES_PARTKEY might be better be something like follows.
> 
> | #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \
> |     ((partattno) != 0 ? \
> |      (IsA((expr), Var) && ((Var *) (expr))->varattno == (partattno)) :\
> |      equal((expr), (partexpr)))

That's better, fixed.

> +        if (!op_in_opfamily(opclause->opno, partopfamily))
> +        {
> ...
> +          negator = get_negator(opclause->opno);
> +          if (OidIsValid(negator) &&
> +            op_in_opfamily(negator, partopfamily))
> +          {
> +            get_op_opfamily_properties(negator, partopfamily,
> +                           false,
> +                           &strategy,
> +                           &lefttype, &righttype);
> +            if (strategy == BTEqualStrategyNumber)
> 
>  This seems to me to be a bit too much relying on the specific
>  relationship of the access methods' property. Isn't it
>  reasonable that add checking that partkey->strategy != 'h'
>  before getting negator?
> 
> +          commuted->opno = get_commutator(opclause->opno);
> 
>  Im afraid that get_commutator can return InvalidOid for
>  user-defined types or by user-defined operator class or perhaps
>  other reasons uncertain to me. match_clauses_to_partkey is
>  checking that.
> 
> +      else if (IsA(clause, ScalarArrayOpExpr))
> 
>  I'm not sure what to do with a multidimentional ArrayExpr but
>  ->multidims is checked some places.

In another thread that I recently started [2], Tom seemed to point out
that ScalarArrayOpExpr cannot really be used for comparing an array on LHS
with, say, an array of arrays on RHS.  The LHS expression must yield a
scalar value which is compared with individual scalar values in the array
on RHS.  The syntax allows arbitrarily multi-dimensional array to be
specified on RHS, but ultimately only looks at the scalar therein.  For
example:

select 1 = any (array[array[array[1]]]);
 ?column?
|---------
 t
(1 row)

select 1 = any ('{{{1}}}');
 ?column?
|---------
 t
(1 row)

select 1 = any ('{{{2}}}');
 ?column?
|---------
 f
(1 row)

But for the code in question on which you seemed to have commented, it
only works if the array is presented as a Const that can be deconstructed
to a list of scalar values using deconstruct_array().  If the array was
presented as an ArrayExpr, and its multidims is true, the elements list
cannot simply be assumed to contain Const nodes holding scalar values.
So, we should prohibit that case from proceeding.  Although, I haven't
been able to frame a test query that results in such a case.

> +        ParseState *pstate = make_parsestate(NULL);
> 
>  make_parsestate mandates for the caller to free it by
>  free_parsestate(). It doesn't seem to leak anything in the
>  context and I saw the same thing at other places but it would be
>  better to follow it if possible, or add some apology as a
>  comment.. (or update the comment of make_parsestate?)

Added a free_parsestate() call.

> 
> +  * If the leftarg_const and rightarg_consr are both of the type expected
> 
>  rightarg_consr -> const

Fixed.

> 
> +        if (partition_cmp_args(partkey, partattoff,
> +                               le, lt, le,
> +                               &test_result))
> 
> +        if (partition_cmp_args(partkey, partattoff, ge, gt, ge,
> +                               &test_result))
> 
>  Please unify the style.

Fixed.

> +       * Boolean conditions have a special shape, which would've been
> +       * accepted if the partitioning opfamily accepts Boolean
> +       * conditions.
> 
>  I noticed that bare true and false are not accepted by the
>  values list of create table syntax. This is not a comment on
>  this patch but is that intentional?

Hmm, I guess the original partitioning patch forgot to accept TRUE_P,
FALSE_P as valid partition bound datums along with Sconst, NumericOnly,
and NULL_P.  Sent a patch for that in a separate thread [3].

Attached updated patches.  As Robert commented [4], I tried to re-arrange
the patches after breaking down the planner patch ("0005 looks like it
might need to be split into smaller patches").  Brief description of each
follows:

[PATCH 1/5] Some interface changes for partition_bound_{cmp/bsearch}

As the name says, it's a preparatory patch to enable the next patch to use
the partition bound binary search function using partial keys.  Until now,
callers needed to specify the whole key (specify values for all partition
columns), but a query may not have clauses on all partition columns, so we
should be able to compare such incomplete keys against partition bounds.

[PATCH 2/5] Introduce a get_partitions_from_clauses()

This used to be a last patch in the previous versions.  But I moved it
ahead in the list as basic functionality that needs to be in place before
starting to modify the planner to use the same for faster pruning.

To summarize, just like the existing get_partition_for_tuple() that
receives a tuple from ExecFindPartition() and returns the index of the
partition that should contain that tuple, get_partitions_from_clauses()
will receive a set of clauses that are all known to match some partition
key and derive from it and return the set of indexes of partitions that
satisfy all those clauses.

[PATCH 3/5] Move some code of set_append_rel_size to separate
 function

First in the series that modifies the planner.  Just a preparatory patch
that moves some code.

[PATCH 4/5] More refactoring around partitioned table AppendPath
 creation

Another refactoring patch that changes how we manipulate the partitions
when generating AppendPath for partitioned tables.

Actually, there is one behavior change here - partitioned_rels in Append
today carry the RT indexes of even the partitioned child tables that have
been pruned.  Patch modifies things so that that doesn't happen anymore.

[PATCH 5/5] Teach planner to use get_partitions_from_clauses()

With this patch, we finally hit the new faster pruning functionality.

Partitions will be pruned even before looking at the individual partition
RelOptInfos.  In fact, set_append_rel_size() now only ever looks at
non-pruned partitions.

Since, we can call get_partitions_from_clauses() from only the SELECT
planning code path, UPDATE/DELETE cases still rely on constraint
exclusion.  So, we retrieve the partition constraint only in those cases.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7b88d63a9122646ead60c1afffc248a31d4e457d

[2] https://www.postgresql.org/message-id/7677.1512743642%40sss.pgh.pa.us

[3]
https://www.postgresql.org/message-id/e05c5162-1103-7e37-d1ab-6de3e0afaf70%40lab.ntt.co.jp

[4]
https://www.postgresql.org/message-id/CA%2BTgmoYYrPA21e0y5w2NW2-sbANFR4n2nbrSWEWjzvaa_GNi0g%40mail.gmail.com

Hello David.

Thanks for the reviews.  Replying to all your emails here.

On 2017/12/19 13:36, David Rowley wrote:
> On 12 December 2017 at 22:13, Amit Langote wrote:
>> Attached updated patches.
> 
> Hi Amit,
> 
> I'm sorry to say this is another micro review per code I'm stumbling
> over when looking at the run-time partition pruning stuff.
> 
> 1. In get_partitions_from_clauses_recurse(), since you're assigning
> the result to the first input, the following should use
> bms_add_members and not bms_union. The logical end result is the same,
> but using bms_union means a wasted palloc and a small memory leak
> within the memory context.
>
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> or_partset = bms_union(or_partset, arg_partset);

Done.  Replaced with bms_add_members().

> 2. Also in get_partitions_from_clauses_recurse(), it might also be
> worth putting in a bms_free(or_partset) after:
> 
> /*
> * Partition sets obtained from mutually-conjunctive clauses are
> * combined using set intersection.
> */
> result = bms_intersect(result, or_partset);

Done, too.

> Also, instead of using bms_intersect here, would it be better to do:
> 
> result = bms_del_members(result, or_partset); ?
>
> That way you don't do a bms_copy and leak member for each OR branch
> since bms_intersect also does a bms_copy()
>
> The resulting set could end up with a few more trailing 0 words than
> what you have now, but it to be a better idea not allocate a new set
> each time.

You meant bms_int_members(), as you also said in your other email.

On 2017/12/19 14:42, David Rowley wrote:
> I was also wondering about your thoughts on the design of
> get_partitions_for_keys() and more generally how there are many
> functions which have some special treatment doing something based on
> ->strategy == PARTITION_STRATEGY_XXX.
>
> If I do:
>
> git grep PARTITION_STRATEGY -- src/backend/catalog/partition.c | wc -l
>
> I get 62 matches, most of which are case statements, and most of the
> remainder are things like if (key->strategy ==
> PARTITION_STRATEGY_HASH).
>
> git grep --show-function PARTITION_STRATEGY -- src/backend/catalog/
> partition.c
>
> shows that get_partitions_for_keys() is probably the most guilty of
> having the most strategy condition tests.

I notice that too now that you mention it.

>
> Also, if we look at get_partitions_for_keys() there's an unconditional:
>
> memset(hash_isnull, false, sizeof(hash_isnull));
>
> which is only used for PARTITION_STRATEGY_HASH, but LIST and RANGE
> must pay the price of that memset.

Although I know you're talking about something else here (about which I
say below), turns out this hash_isnull was completely unnecessary, so I
got rid of it.

> Perhaps it's not expensive enough
> to warrant only doing that when partkey->strategy ==
> PARTITION_STRATEGY_HASH, but it does make me question if we should
> have 3 separate functions for this and just have a case statement to
> call the correct one.
>
> I think if we were to put this off as something we'll fix later, then
> the job would just become harder and harder as time goes on.
>
> It might have been fine when we just had RANGE and LIST partitioning,
> but I think HASH really tips the scales over to this being needed.
>
> What do you think?

I think I somewhat understand your concern with regard to future additions
and maintenance and also now tend to agree.

I tried dividing up get_partitions_for_keys() into one function each for
hash, list, and range and it looks like in the attached.  I think I like
the result.  Each function has to deal with only query keys and bounds
assuming a given partitioning method and that appears to add to the
overall clarity of the code.

On 2017/12/19 22:44, David Rowley wrote:
> Again, another micro review. I apologise for the slow trickle of
> review. Again, these are just things I'm noticing while reading
> through while thinking of the run-time pruning patch.
>
>
> 1. The following Assert appears to be testing for the presence of
> cosmic rays :-)
>
> /*
> * Determine set of partitions using provided keys, which proceeds in a
> * manner determined by the partitioning method.
> */
> if (keys->n_eqkeys == partkey->partnatts)
> {
> Assert(keys->n_eqkeys == partkey->partnatts);
>
> Perhaps it's misplaced during a rewrite? Should be safe enough to
> remove it, I'd say.

I noticed that too and took care of it. :)

> 2. The following code in classify_partition_bounding_keys() misses
> looking under the RelabelType for rightarg:
>
> leftop = (Expr *) get_leftop(clause);
> if (IsA(leftop, RelabelType))
> leftop = ((RelabelType *) leftop)->arg;
> rightop = (Expr *) get_rightop(clause);
> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
> constexpr = rightop;
> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
> constexpr = leftop;
>
> This breaks the following case:
>
> create table thisthat (a varchar not null) partition by list (a);
> create table this partition of thisthat for values in('this');
> create table that partition of thisthat for values in('that');
> explain select * from thisthat where 'this' = a; -- does not work
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append  (cost=0.00..54.00 rows=14 width=32)
>    ->  Seq Scan on that  (cost=0.00..27.00 rows=7 width=32)
>          Filter: ('this'::text = (a)::text)
>    ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
>          Filter: ('this'::text = (a)::text)
> (5 rows)
>
> explain select * from thisthat where a = 'this'; -- works as we look
> through the RelabelType on left arg.
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append  (cost=0.00..27.00 rows=7 width=32)
>    ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
>          Filter: ((a)::text = 'this'::text)

Thanks for pointing it out, fixed.

> 3. The follow code assumes there will be a commutator for the operator:
>
> if (constexpr == rightop)
> pc->op = opclause;
> else
> {
> OpExpr   *commuted;
>
> commuted = (OpExpr *) copyObject(opclause);
> commuted->opno = get_commutator(opclause->opno);
> commuted->opfuncid = get_opcode(commuted->opno);
> commuted->args = list_make2(rightop, leftop);
> pc->op = commuted;
> }
>
> I had to hunt for it, but it appears that you're pre-filtering clauses
> with the Consts on the left and no valid commutator in
> match_clauses_to_partkey. I think it's likely worth putting a comment
> to mention that reversed clauses with no commutator should have been
> filtered out beforehand. I'd say it's also worthy of an Assert().

Yeah, added a comment and an Assert.

>
> 4. The spelling of "arbitrary" is incorrect in:
>
> * partattno == 0 refers to arbirtary expressions, which get the

Fixed.

> 5. I've noticed that partition pruning varies slightly from constraint
> exclusion in the following case:
>
> create table ta (a int not null) partition by list (a);
> create table ta1 partition of ta for values in(1,2);
> create table ta2 partition of ta for values in(3,4);
>
> explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is
> not eliminated.
>                          QUERY PLAN
> -------------------------------------------------------------
>  Append  (cost=0.00..96.50 rows=5050 width=4)
>    ->  Seq Scan on ta1  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
> (5 rows)
>
>
> alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a
> check constraint to see if can be removed.
> explain select * from ta where a <> 1 and a <> 2; -- it can.
>                          QUERY PLAN
> -------------------------------------------------------------
>  Append  (cost=0.00..48.25 rows=2525 width=4)
>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
> (3 rows)

I see.  It seems that the current approach of handling <> operators by
turning clauses containing the same into (key > const OR key < const)
doesn't always work.  I think I had noticed that for list partitioning at
least.  I will work on alternative way of handling that in the next
version of the patch.

Meanwhile, please find attached patches (v15) that take care of the rest
of the comments.  Most of the updates are to patch 0002, compared to the
last (v14) version.

Thanks again for your thoughtful review comments.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

21 декабря 2017 г., 16:38:36

On 2017/12/20 17:27, Amit Langote wrote:
> On 2017/12/19 13:36, David Rowley wrote:
>> 5. I've noticed that partition pruning varies slightly from constraint
>> exclusion in the following case:
>>
>> create table ta (a int not null) partition by list (a);
>> create table ta1 partition of ta for values in(1,2);
>> create table ta2 partition of ta for values in(3,4);
>>
>> explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is
>> not eliminated.
>>                          QUERY PLAN
>> -------------------------------------------------------------
>>  Append  (cost=0.00..96.50 rows=5050 width=4)
>>    ->  Seq Scan on ta1  (cost=0.00..48.25 rows=2525 width=4)
>>          Filter: ((a <> 1) AND (a <> 2))
>>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>>          Filter: ((a <> 1) AND (a <> 2))
>> (5 rows)
>>
>>
>> alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a
>> check constraint to see if can be removed.
>> explain select * from ta where a <> 1 and a <> 2; -- it can.
>>                          QUERY PLAN
>> -------------------------------------------------------------
>>  Append  (cost=0.00..48.25 rows=2525 width=4)
>>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>>          Filter: ((a <> 1) AND (a <> 2))
>> (3 rows)
> 
> I see.  It seems that the current approach of handling <> operators by
> turning clauses containing the same into (key > const OR key < const)
> doesn't always work.  I think I had noticed that for list partitioning at
> least.  I will work on alternative way of handling that in the next
> version of the patch.

I think I was able to make this work and in the process of making it work,
also came to the conclusion that this could be made to work sensibly
*only* for list partitioned tables.  That's because one cannot prune a
given partition using a set of <> operator clauses, if we cannot be sure
that those clauses exclude *all* values of the partition key allowed by
that partition.  It's only possible to do that for a list partitioned
table, because by definition one is required to spell out every value that
a given partition of such table allows.

There is a new function in the updated patch that does the pruning using
<> operator clauses and it's implemented by assuming it's only ever called
for a list partitioned table.  So, sorry range and hash partitioned tables.

Attached updated set of patches.

Thanks,
Amit

Hi David.

On 2017/12/22 10:35, David Rowley wrote:
> On 22 December 2017 at 13:57, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Will post the fixed version shortly, thanks.
> 
> I've made another partial pass on the patch and have a few more
> things. #3 and #4 are just comments rather than requests to change
> something. I think we should change those before PG11 though.

Thank you.

> 1. If I look at the elog(ERROR) messages in partition.c, there are a
> number of variations of reporting an invalid partition strategy.
> 
> There seem to be 3 variations of the same thing. Probably the
> "unexpected" one would suit most, but I've not looked too closely.
> 
> elog(ERROR, "invalid partitioning strategy");
> elog(ERROR, "unexpected partition strategy: %d", (int) key->strategy);
> elog(ERROR, "invalid partition strategy %c",
> RelationGetPartitionKey(rel)->strategy);

I should have used the "unexpected ..." wording in the yesterday's update.
 Fixed.

> 2. In get_relation_constraints(). Can you add a comment to why you've added:
> 
> /* Append partition predicates, if any */
> if (root->parse->commandType != CMD_SELECT)
> {
> 
> I guess it must be because we use the new partition pruning code for
> SELECT, but not for anything else.

Yeah, I explained that a couple of times on email (maybe also in the
commit message), but not there.  Done.

> 3. It's a shame that RelOptInfo->live_partitioned_rels is a List and
> not a RelIds. I guess if you were to change that you'd need to also
> change AppendPath->partitioned_rels too, so probably we can just fix
> that later.

I agree.

> 4. live_part_appinfos I think could be a Relids type too, but probably
> we can change that after this patch. Append subpaths are sorted in
> create_append_path() for parallel append, so the order of the subpaths
> seems non-critical.

Hmm, perhaps.

> 5. Small memory leaks in get_partitions_from_clauses_recurse().
> 
> if (ne_clauses)
> result = bms_int_members(result,
> get_partitions_from_ne_clauses(relation,
> ne_clauses));
> 
> Can you assign the result of get_partitions_from_ne_clauses() and
> bms_free() it after the bms_int_members() ?
> 
> Same for:
> 
> result = bms_int_members(result,
> get_partitions_from_or_clause_args(relation,
> rt_index,
> or->args));
> 
> The reason I'm being particular about this is that for the run-time
> pruning patch we'll call this from ExecReScanAppend() which will
> allocate into the ExecutorState which lives as long as the query does.
> So any leaks will last the entire length of the query.
> ExecReScanAppend() could be called millions of billions of times, so
> we need to be sure that's not going to be a problem.

That's a very important point to stress.  Thanks.

> 6. Similar to #5, memory leaks in get_partitions_from_or_clause_args()
> 
> arg_partset = get_partitions_from_clauses_recurse(relation, rt_index,
>   arg_clauses);
> 
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> result = bms_add_members(result, arg_partset);
> 
> Need to bms_free(arg_partset)

Fixed all these instances of leaks.

> Running out of time for today, but will look again in about 4 days.

Thanks again.

Please find attached updated patches.

Thanks,
Amit

I happened to notice that Ashutosh's patch series at
https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com
has a 0001 patch that modifies the partition_bound_cmp stuff too.
Are those conflicting?

Ashutosh's commit message:
Modify bound comparision functions to accept members of PartitionKey

Functions partition_bound_cmp(), partition_rbound_cmp() and
partition_rbound_datum_cmp() are required to merge partition bounds
from joining relations. While doing so, we do not have access to the
PartitionKey of either relations. So, modify these functions to accept
only required members of PartitionKey so that the functions can be
reused for merging bounds.

Amit's:
Some interface changes for partition_bound_{cmp/bsearch}

Introduces a notion of PartitionBoundCmpArg, which replaces the set
of arguments void *probe and bool probe_is_bound of both
partition_bound_cmp and partition_bound_bsearch. It wasn't possible
before to specify the number of datums when a non-bound type of
probe is passed. Slightly tweaking the existing interface to allow
specifying the same seems awkward. So, instead encapsulate that
into PartitionBoundCmpArg. Also, modify partition_rbound_datum_cmp
to compare caller-specifed number of datums, instead of
key->partnatts datums.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

04 января 2018 г., 20:29:30

On Fri, Dec 29, 2017 at 6:32 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I happened to notice that Ashutosh's patch series at
> https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com
> has a 0001 patch that modifies the partition_bound_cmp stuff too.
> Are those conflicting?
>
> Ashutosh's commit message:
>         Modify bound comparision functions to accept members of PartitionKey
>
>         Functions partition_bound_cmp(), partition_rbound_cmp() and
>         partition_rbound_datum_cmp() are required to merge partition bounds
>         from joining relations. While doing so, we do not have access to the
>         PartitionKey of either relations. So, modify these functions to accept
>         only required members of PartitionKey so that the functions can be
>         reused for merging bounds.
>
> Amit's:
>         Some interface changes for partition_bound_{cmp/bsearch}
>
>         Introduces a notion of PartitionBoundCmpArg, which replaces the set
>         of arguments void *probe and bool probe_is_bound of both
>         partition_bound_cmp and partition_bound_bsearch.  It wasn't possible
>         before to specify the number of datums when a non-bound type of
>         probe is passed.  Slightly tweaking the existing interface to allow
>         specifying the same seems awkward.  So, instead encapsulate that
>         into PartitionBoundCmpArg.  Also, modify partition_rbound_datum_cmp
>         to compare caller-specifed number of datums, instead of
>         key->partnatts datums.
>

I haven't looked at Amit's changes, but we need a more flexible way to
pass information required for datum comparison than using
PartitionKey, since that's not available in the optimizer and can not
be associated with join, aggregate relations. If we pass that
information through a structure, there are two ways
1. it will need to be part of PartitionScheme; I am not sure if we can
have a substructure in PartitionKey. But if we can do it that way, we
can pass that structure to the functions.
2. we will need to construct the structure filling it with comparison
information and pass it to the comparison functions. I think what we
achieve out of this isn't worth the code we will need to add.

I would prefer first approach over the other.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

04 января 2018 г., 22:28:23

> From 8d627b910278203151853d324c3319c265cd36c0 Mon Sep 17 00:00:00 2001
> From: amit <amitlangote09@gmail.com>
> Date: Tue, 22 Aug 2017 13:48:13 +0900
> Subject: [PATCH 2/5] Introduce a get_partitions_from_clauses()

This one fails to apply.  Please rebase.

Did you know you can use "git format-patch -v6" to generate
appropriately named patch files without having to rename them yourself?


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

05 января 2018 г., 00:16:11

Hi Amit,

On 12/21/2017 11:25 PM, Amit Langote wrote:
> Thanks again.
> 
> Please find attached updated patches.
> 

I have been looking at this patch from a simple hash partition point of 
view.

-- ddl.sql --
CREATE TABLE t1 (
     a integer NOT NULL,
     b integer NOT NULL
) PARTITION BY HASH (b);

CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);

CREATE INDEX idx_t1_b_a_p00 ON t1_p00 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p01 ON t1_p01 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p02 ON t1_p02 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p03 ON t1_p03 USING btree (b, a);

INSERT INTO t1 (SELECT i, i FROM generate_series(1, 1000000) AS i);

ANALYZE;
-- ddl.sql --

w/

-- select.sql --
\set b random(1, 1000000)
BEGIN;
SELECT t1.a, t1.b FROM t1 WHERE t1.b = :b;
COMMIT;
-- select.sql --

using pgbench -c X -j X -M prepared -T X -f select.sql part-hash

On master we have generic_cost planning cost of 33.75, and an 
avg_custom_cost of 51.25 resulting in use of the generic plan and a TPS 
of 8893.

Using v17 we have generic_cost planning cost of 33.75, and an 
avg_custom_cost of 25.9375 resulting in use of the custom plan and a TPS 
of 7129 - of course due to the generation of a custom plan for each 
invocation.

Comparing master with an non-partitioned scenario; we have a TPS of 
12968, since there is no overhead of ExecInitAppend (PortalStart) and 
ExecAppend (PortalRun).

Could you share your thoughts on

1) if the generic plan mechanics should know about the pruning and hence 
give a lower planner cost

1) if the patch should be more aggressive in removing planning nodes 
that aren't necessary, e.g. going from Append -> IndexOnly to just 
IndexOnly.

I have tested with both [1] and [2], but would like to know about your 
thoughts on the above first.

Thanks in advance !

[1] https://commitfest.postgresql.org/16/1330/
[2] https://commitfest.postgresql.org/16/1353/

Best regards,
  Jesper

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

05 января 2018 г., 07:13:39

On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> 1) if the patch should be more aggressive in removing planning nodes that
> aren't necessary, e.g. going from Append -> IndexOnly to just IndexOnly.

That's not for this patch. There's another patch [1] to do that already.

[1] https://commitfest.postgresql.org/16/1353/


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

05 января 2018 г., 08:21:47

On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> \set b random(1, 1000000)
> BEGIN;
> SELECT t1.a, t1.b FROM t1 WHERE t1.b = :b;
> COMMIT;
> -- select.sql --
>
> using pgbench -c X -j X -M prepared -T X -f select.sql part-hash
>
> On master we have generic_cost planning cost of 33.75, and an
> avg_custom_cost of 51.25 resulting in use of the generic plan and a TPS of
> 8893.
>
> Using v17 we have generic_cost planning cost of 33.75, and an
> avg_custom_cost of 25.9375 resulting in use of the custom plan and a TPS of
> 7129 - of course due to the generation of a custom plan for each invocation.
>
> Comparing master with an non-partitioned scenario; we have a TPS of 12968,
> since there is no overhead of ExecInitAppend (PortalStart) and ExecAppend
> (PortalRun).
>
> Could you share your thoughts on
>
> 1) if the generic plan mechanics should know about the pruning and hence
> give a lower planner cost

I think the problem here is that cached_plan_cost() is costing the
planning cost of the query too low. If this was costed higher then its
more likely the generic plan would have been chosen, instead of
generating a custom plan each time.

How well does it perform if you change cpu_operator_cost = 0.01?

I think cached_plan_cost() does need an overhaul, but I think it's not
anything that should be done as part of this patch. You've picked HASH
partitioning here just because the current master does not perform any
partition pruning for that partitioning strategy.

There also might be a tiny argument here to have some method of
disabling the planner's partition pruning as we could before with SET
constraint_exclusion = 'off', but I think that's about the limit of
the interest this patch should have in that problem.

(The problem gets more complex again when doing run-time pruning, but
that's not a topic for this thread)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

06 января 2018 г., 00:04:08

Hi David,

On 01/04/2018 09:21 PM, David Rowley wrote:
> On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>> Could you share your thoughts on
>>
>> 1) if the generic plan mechanics should know about the pruning and hence
>> give a lower planner cost
> 
> I think the problem here is that cached_plan_cost() is costing the
> planning cost of the query too low. If this was costed higher then its
> more likely the generic plan would have been chosen, instead of
> generating a custom plan each time.
> 
> How well does it perform if you change cpu_operator_cost = 0.01?
> 

It gives 38.82 for generic_cost, and 108.82 for avg_custom_cost on 
master (8249 TPS). And, 38.82 for generic_cost, and 79.705 for 
avg_custom_cost with v17 (7891 TPS). Non-partitioned is 11722 TPS.

> I think cached_plan_cost() does need an overhaul, but I think it's not
> anything that should be done as part of this patch. You've picked HASH
> partitioning here just because the current master does not perform any
> partition pruning for that partitioning strategy.
>

Well, I mainly picked HASH because that is my use-case :)

For a range based setup it gives 39.84 for generic_cost, and 89.705 for 
avg_custom_cost (7862 TPS).

Best regards,
  Jesper

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

09 января 2018 г., 14:40:08

On 2018/01/05 1:28, Alvaro Herrera wrote:
>> From 8d627b910278203151853d324c3319c265cd36c0 Mon Sep 17 00:00:00 2001
>> From: amit <amitlangote09@gmail.com>
>> Date: Tue, 22 Aug 2017 13:48:13 +0900
>> Subject: [PATCH 2/5] Introduce a get_partitions_from_clauses()
> 
> This one fails to apply.  Please rebase.

Sorry about the absence in the last few days.  I will post a new version
addressing various review comments by the end of this week.

> Did you know you can use "git format-patch -v6" to generate
> appropriately named patch files without having to rename them yourself?

Oh, didn't know that trick.  Will try, thanks.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

10 января 2018 г., 04:35:49

On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry about the absence in the last few days.  I will post a new version
> addressing various review comments by the end of this week.

Good to have you back.

There's a small problem with get_partitions_for_keys_list(), the
following case hits the Assert(false) at the bottom of that function.

create table ab_c (a int not null, b char) partition by list(a);
create table abc_a2 (b char, a int not null) partition by list(b);
create table abc_a2_b3 partition of abc_a2 for values in ('3');
alter table ab_c attach partition abc_a2 for values in (2);

select * from ab_c where a between 1 and 2 and b <= '2';

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

10 января 2018 г., 07:55:28

On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry about the absence in the last few days.  I will post a new version
> addressing various review comments by the end of this week.

One more thing I discovered while troubleshooting a bug Beena reported
in the run-time partition pruning patch is that
classify_partition_bounding_keys properly does;

if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
    constexpr = rightop;
else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
    constexpr = leftop;
else
    /* Clause not meant for this column. */
    continue;

for OpExpr clauses, but does not do the same for leftop for the
ScalarArrayOpExpr test.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

10 января 2018 г., 10:18:58

On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry about the absence in the last few days.  I will post a new version
> addressing various review comments by the end of this week.

I'm sorry for the flood of emails today.

Beena's tests on the run-time partition pruning patch also indirectly
exposed a problem with this patch.

Basically, the changes to add_paths_to_append_rel() are causing
duplication in partition_rels.

A test case is:

create table part (a int, b int) partition by list(a);
create table part1 partition of part for values in(1) partition by list (b);
create table part2 partition of part1 for values in(1);

select * from part;

partition_rels ends up with 3 items in the list, but there's only 2
partitions here. The reason for this is that, since planning here is
recursively calling add_paths_to_append_rel, the list for part ends up
with itself and part1 in it, then since part1's list already contains
itself, per set_append_rel_size's "rel->live_partitioned_rels =
list_make1_int(rti);", then part1 ends up in the list twice.

It would be nicer if you could use a RelIds for this, but you'd also
need some way to store the target partition relation since
nodeModifyTable.c does:

/* The root table RT index is at the head of the partitioned_rels list */
if (node->partitioned_rels)
{
    Index root_rti;
    Oid root_oid;

    root_rti = linitial_int(node->partitioned_rels);
    root_oid = getrelid(root_rti, estate->es_range_table);
    rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
}

You could also fix it by instead of doing:

/*
* Accumulate the live partitioned children of this child, if it's
* itself partitioned rel.
*/
if (childrel->part_scheme)
    partitioned_rels = list_concat(partitioned_rels,
       childrel->live_partitioned_rels);

do something along the lines of:

if (childrel->part_scheme)
{
    ListCell *lc;
    ListCell *start = lnext(list_head(childrel->live_partitioned_rels));

    for_each_cell(lc, start)
       partitioned_rels = lappend_int(partitioned_rels,
       lfirst_int(lc));
}

Although it seems pretty fragile. It would probably be better to find
a nicer way of handling all this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

Amit Langote

Дата:

11 января 2018 г., 15:52:09

Hi David.

Thanks a lot of the review.  I'm replying to your multiple emails here.

On 2017/12/28 11:07, David Rowley wrote:
> I've just completed a pass over the v17 patch set. I've found a number
> of things that need to be addressed. Some might seem a bit nit-picky,
> sorry about that. However, many of the others genuinely need to be
> addressed.
> 
> 1. The following code calls RelationGetPartitionQual, but the output
> of that is only needed when partition_bound_has_default(boundinfo) is
> true.
> 
> Can you change this to only get the RelationGetPartitionQual when it's required?

OK, done that way.

> 2. The header comment in match_clauses_to_partkey() does not give any
> warning that 'clauses' is modified within the function.
> 
> The function should create a copy of the clauses before modifying
> them. This will save you having to do any list_copy calls when you're
> calling the function.

OK, added a list_copy on clauses at the beginning of the function.

> 
> The header comment is also not very clear about what the return value
> of the function is.

Fixed the comment to describe return values.

> 3. "method" I think should be "strategy". We've pretty much
> standardised on that term everywhere else, so let's keep to standard.
> 
> /*
> * Does the query specify a key to be null or not null?  Partitioning
> * handles null partition keys specially depending on the partitioning
> * method in use, we store this information.
> */

Fixed.

> 4. "relation" should be in single quotes, since you're talking about
> the parameter named "relation". Likewise with "partclauses", otherwise
> it just seems like bad English.
> 
>  * Determine the set of partitions of relation that will satisfy all
>  * the clauses contained in partclauses

Fixed.

> 5. partdesc's assignment can be delayed until it's needed. This will
> save generating it when constfalse == true
> 
> static Bitmapset *
> get_partitions_from_clauses_recurse(Relation relation, int rt_index,
> List *clauses)
> {
> PartitionDesc partdesc = RelationGetPartitionDesc(relation);

Moved the whole line to a tiny block where partdesc is used.

> 6. In the following comment, I'd have expected almost identical code,
> just with some other List, but the code probably differs a bit too
> much to use "Ditto".
> 
> /*
> * Ditto, but this time or_clauses.
> */

Hmm, OK. I tried rewriting the comment a bit.

> 
> 7. Comment claims we use "set union", but we're really just collecting
> the members from the other set in:
> 
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> result = bms_add_members(result, arg_partset);

Guess the comment doesn't really add much in this case, so just deleted it.

> 8. These arrays could just be initialized up to partkey->partnatts.
> I'd imagine most of the time this will be just 1, so would save
> needlessly setting the 31 other elements, although, perhaps it's a bad
> idea to optimize this.
> 
> memset(keyclauses_all, 0, sizeof(keyclauses_all));
> /* false means we don't know if a given key is null */
> memset(keyisnull, false, sizeof(keyisnull));
> /* false means we don't know if a given key is not null */
> memset(keyisnotnull, false, sizeof(keyisnull));
> 
> The last two of these could just be Bitmapsets and you'd not need any
> memset at all. PARTITION_MAX_KEYS just so happens to be the same as
> BITS_PER_BITMAPWORD in a standard build, so you'd always be able to
> mark everything with a single bitmap word. This would help a bit in
> the various places that you're counting the true elements, for
> example, the following code:
> 
> for (i = 0; i < partkey->partnatts; i++)
> {
> if (!keys->keyisnotnull[i])
> {
> include_def = true;
> break;
> }
> }
> 
> could become:
> 
> include_def = !bms_is_empty(keys->keyisnotnull);
> 
> if you converted all these to Bitmapsets.

I liked the idea of using Bitmapset for keyisnull and keyisnotnull, so
implemented it.

> 9. The following comment would be better read as: /* clause does not
> match this partition key */
> 
> /* Clause not meant for this column. */
> 
> 10. The following comment talks about handling less than operators for
> hash opfamilies, but the code only handles <> for btree and list
> partitioning.
> 
> * Handle some cases wherein the clause's operator may not
> * belong to the partitioning operator family.  For example,
> * operators named '<>' are not listed in any operator
> * family whatsoever.  Also, ordering opertors like '<' are
> * not listed in the hash operator family.
> 
> "opertors" should be spelled "operators"

Rewrote the comments here a bit.

> 11. In the following comment "operator" should be "operators":
> 
>  * are constrained by clauses containing equality operator, unless hash
> 
> Likewise for:
> 
>  * IS NULL clauses while remaining have clauses with equality operator.

Fixed.

> 12. The following code in classify_partition_bounding_keys probably
> should use EXPR_MATCHES_PARTKEY.
> 
> /* Does leftop match with this partition key column? */
> if ((IsA(arg, Var) &&
> ((Var *) arg)->varattno == partattno) ||
> equal(arg, partexpr))

Done.

> 13. The comment for classify_partition_bounding_keys does not seem to
> define well what the meaning of the return value is:
> 
>  * classify_partition_bounding_keys
>  * Classify partition clauses into equal, min, and max keys, along with
>  * any Nullness constraints and return that information in the output
>  * argument keys (number of keys is the return value)
> 
> I had thought all along that this must mean the number of distinct
> partition keys that we've found useful clauses for, but it's possible
> to fool this with duplicate IS NULL checks.
> 
> create table hp (a int, b text) partition by hash (a, b);
> create table hp0 partition of hp for values with (modulus 4, remainder 0);
> create table hp3 partition of hp for values with (modulus 4, remainder 3);
> create table hp1 partition of hp for values with (modulus 4, remainder 1);
> create table hp2 partition of hp for values with (modulus 4, remainder 2);
> 
> explain select * from hp where a is null and a is null;
> 
> This causes classify_partition_bounding_keys to return 2. I can't see
> any bad side effects of this, but I think the comment needs to be
> improved. Perhaps it's better just to make the function return bool,
> returning true if there are any useful keys?

OK, I made classify_partition_bounding_keys() return a bool instead.

> 14. The switch statement in partition_op_strategy could be simplified
> a bit and be written more as:
> 
> case BTLessEqualStrategyNumber:
>   *incl = true;
>   /* fall through */
> case BTLessStrategyNumber:
>   result = PART_OP_LESS;
>   break;

That's better, done.

> 15. No default case statements in partition_op_strategy() could result
> in "result" not being set.

I added a default case, but it's simply an elog(ERROR, ...).

> 16. I'm unsure what the following comment means:
> 
> * To set eqkeys, we must have found the same for partition key columns.
> 
> The word "for" seems wrong, or I'm not sure what it wants to ensure it
> finds the same for.

I rewrote this comment a bit, since like you, I too am no longer able to
make sense of it.  Just wanted to say here that to set keys->eqkeys at
all, we must have found matching clauses containing equality operators for
all partition keys columns.

> 17. Header comment for partition_op_strategy is out-of-date.
> 
> /*
>  * Returns -1, 0, or 1 to signify that the partitioning clause has a </<=,
>  * =, and >/>= operator, respectively.  Sets *incl to true if equality is
>  * implied.
>  */
> 
> It'll never return -1.

Oops, fixed.

> 18. The following comment has some grammar issues:
> 
> /*
> * If couldn't coerce to the partition key type, that is, the type of
> * datums stored in PartitionBoundInfo, no hope of using this
> * expression for anything partitioning-related.
> */
> 
> Would be better with:
> 
> /*
> * If we couldn't coerce the partition key type, that is, the type
> * of datums stored in PartitionBoundInfo, then there's no hope of
> * using this expression for anything partitioning-related.
> */

OK, done.

> 19. In partkey_datum_from_expr() the Assert(false) at the end seems
> overkill. You could just get rid of the default in the switch
> statement and have it fall through to the final return. This would
> save 4 lines of code.

Actually, let's just get rid of the switch.  If I remove the default case
like you suggest, there are NodeTag enum values not handled warnings.

> 20. The word "and" I think needs removed from the following comment:
> 
> * Couldn't compare; keep hash_clause set to the previous value and
> * so add this one directly to the result.  Caller would

Actually, it seems better to keep the "and" and remove the "so".

> Probably also needs a comment after "value"

Done.

> 21. The following comment seems to indicate 'cur' is an operator, but
> it's a PartClause:
> 
> /* The code below is for btree operators, which cur is not. */
> 
> It might be better to write
> 
> /* The code below handles Btree operators which are not relevant to a
> hash-partitioned table. */

Agreed, done.

> 22. "Stuff" should be "The code" in:
> 
> * Stuff that follows closely mimics similar processing done by

Done.

> 23. In remove_redundant_clauses() and various other places, you have a
> variable named partattoff. What's the meaning of this name? I'd
> imagine it is short for "partition attribute offset", but it's a
> "partition key index", so why not "partkeyidx"?
>
> Of course, you might claim that an array index is just an offset, but
> it is a little confusing if you think of attribute offsets in a
> TupleDesc.

Hmm, OK.  Replaced partattoff with partkeyidx.

> 24. get_partitions_for_keys() you're using "return" and "break" in the
> switch statement. You can remove the breaks;

Oops, fixed.

> 25. The Assert(false) in get_partitions_for_keys seems overkill. I
> think it's fine to just do:
> 
> return NULL; /* keep compiler quiet */

Done, too.

> 26. The following code comment in get_partitions_for_keys_hash() does
> not seem very well written:
> 
> * Hash partitioning handles puts nulls into a normal partition and
> * doesn't require to define a special null-accpting partition.
> * Caller didn't count nulls as a valid key; do so ourselves.
> 
> Maybe "puts" should be "storing"?
> 
> Also, does hash partitioning actually support a NULL partition? This
> seems to say it doesn't require, but as far as I can see it does not
> *support* a NULL partition. The comment is a bit misleading.
> "accpting" should be "accepting".

I rewrote that comment.  I wanted to say that, unlike range and list
partitioning which have special handling for null values (there is only
one list/range partition at any given time that could contain nulls in the
partition key), hash partitioning does not.  All hash partitions could
contain nulls in one or more of the partition keys and hence we must
consider nulls as regular equality keys.

> 27. In get_partitions_for_keys_hash() why is it possible to get a
> result_index below 0? In that case, the code will end up triggering
> the Assert(false), but if you want to Assert something here then maybe
> it should be Assert(result_index >= 0)?
> 
> if (result_index >= 0)
> return bms_make_singleton(result_index);

I too used to think that it's impossible to get result_index < 0, but it's
actually possible, because not all required hash partitions may have been
defined yet:

create table hashp (a int) partition by hash (a);
create table hashp0 partition of hashp for values with (modulus 4,
remainder 0);

If user only defines one partition like shown above, then there are no
partitions for when the remainders for a given partition key turns out to
be 1, 2, or 3.

create table hashp (a int) partition by hash (a);
create table hashp0 partition of hashp for values with (modulus 4,
remainder 0);

insert into hashp values (1);
INSERT 0 1
explain select * from hashp where a = 1;
                          QUERY PLAN
--------------------------------------------------------------
 Append  (cost=0.00..41.88 rows=13 width=4)
   ->  Seq Scan on hashp0  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 1)
(3 rows)

insert into hashp values (2);
ERROR:  no partition of relation "hashp" found for row
DETAIL:  Partition key of the failing row contains (a) = (2).

explain select * from hashp where a = 2;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I guess I forgot to remove that Assert(false) when fixing the code after
realizing that it's indeed possible to not get a hash partition for some keys.

Removed the Assert.

> 28. Is there any point in the following loop in get_partitions_for_keys_list()?
> 
> /*
> * We might be able to get the answer sooner based on the nullness of
> * keys, so get that out of the way.
> */
> for (i = 0; i < partkey->partnatts; i++)
> {
> 
> Won't partkey->partnatts always be 1? Maybe you can Assert that
> instead, just in case someone forgets to change the code if LIST is to
> support multiple partition keys.

I guess that's a leftover from when all partition strategies were handled
in one function.  Got rid of the loop and added the Assert.

> 
> 29. In get_partitions_for_keys_list and get_partitions_for_keys_range
> partdesc is only used for an Assert. Maybe you can just:
> 
> Assert(RelationGetPartitionDesc(rel)->nparts > 0);

OK, done.

> 30. The Assert(false) in get_partitions_for_keys_list() looks
> suspiciously easy to hit...
> 
> create table lp (a int not null) partition by list(a);
> create table lp1 partition of lp for values in(1);
> explain select * from lp where a > 11; -- Assert fail!
> 
> 31. ranget?
> 
>  * get_partitions_for_keys_range
>  * Return partitions of a ranget partitioned table for requested keys

The Assert indeed was easy to hit.  Removed.

> 32. I'm not quite sure what this comment is saying:
> 
> /*
> * eqoff is gives us the bound that is known to be <= eqkeys,
> * given how partition_bound_bsearch works.  The bound at eqoff+1,
> * then, would be the upper bound of the only partition that needs
> * to be scanned.
> */

Rewrote the comment as:

      /*
       * The bound at eqoff is known to be <= eqkeys, given the way
       * partition_bound_bsearch works.  Considering the same as the lower
       * bound of the partition that eqkeys falls into, the bound at
       * eqoff + 1 would be its upper bound, so use eqoff + 1 to get the
       * desired partition's index.
       */

Any better?

> 33. "one" -> "the one"
> 
> * Find the leftmost bound that satisfies the query, i.e., one that
> * satisfies minkeys.

Fixed.

> 34. Why "will" there be multiple partitions sharing the same prefix?
> 
> * If only a prefix of the whole partition key is provided, there will
> * be multiple partitions whose bound share the same prefix.  If minkey
> 
> Perhaps "will" should be "may"?

I guess you're right.

> 35. I think "would've" in the following comment is not correct. This
> seems to indicate that this has already happened, which it has not.
> "will have to be" might be more correct?
>
> * satisfy the query, but don't have a valid partition assigned.  The
> * default partition would've been included to cover those values.

"will have to be included" sounds correct to me too.

> 36. "will" -> "with"?
> 
> * Since partition keys will nulls are mapped to default range

Fixed.

> 37. "If no" -> "If there's no"
> 
> * If no tuple datum to compare with the bound, consider
> * the latter to be greater.



> 
> 38. I don't see anything about setting keynullness here:
> 
> /*
> * Get the clauses that match the partition key, including information
> * about any nullness tests against partition keys.  Set keynullness to
> * a invalid value of NullTestType, which 0 is not.
> */
> partclauses = match_clauses_to_partkey(root, rel,
>    list_copy(rel->baserestrictinfo),
>    &contains_const,
>    &constfalse);

I guess the comment is too ancient.  Fixed.

> 
> 39. "paritions" -> "partitions"
> 
> * Else there are no clauses that are useful to prune any paritions,

Fixed.

> 40. In match_clauses_to_partkey, the following code pulls the varnos
> from each operand of the expression. It would be better to just pull
> the side that's needed (if any) a bit later rather than always doing
> both.
> 
> left_relids = pull_varnos((Node *) leftop);
> right_relids = pull_varnos((Node *) rightop);

Ah, done.

> 41. I think a list_copy is missing here:
> 
> /*
> * Accumulate the live partitioned children of this child, if it's
> * itself partitioned rel.
> */
> if (childrel->part_scheme)
> partitioned_rels = list_concat(partitioned_rels,
>    childrel->live_partitioned_rels);
> 
> Surely if we don't perform a list_copy of
> childrel->live_partitioned_rels then subsequent additions to the
> resulting list will inadvertently add new items to the
> childrel->live_partitioned_rels?

Yeah, I think what's in the patch now is clearly hazardous.  Added a
list_copy on childrel->live_partitioned_rels.

> 42. Is this because of an existing bug?
> 
> @@ -1906,11 +1904,13 @@ explain (costs off) select * from mcrparted
> where abs(b) = 5; -- scans all parti
>           Filter: (abs(b) = 5)
>     ->  Seq Scan on mcrparted3
>           Filter: (abs(b) = 5)
> +   ->  Seq Scan on mcrparted4
> +         Filter: (abs(b) = 5)
>     ->  Seq Scan on mcrparted5
>           Filter: (abs(b) = 5)

Actually, to constraint exclusion's eyes, mcrparted4's partition
constraint is refuted by abs(b) = 5.

With the new patch, we simply do not perform any partition pruning for
that clause, because we do not have a constraint on an earlier partition
key column.

> 43. In partition_prune.sql you have a mix of /* */ and -- comments.
> Please just use --

Oops, fixed.

On 2017/12/29 15:47, David Rowley wrote:
> Just a few extras that I found:
>
> 44. In match_clauses_to_partkey you're making use of
> estimate_expression_value(), I don't think this is safe.
>
> if (IsA(estimate_expression_value(root, rightop), Const))
> *contains_const = true;
>
> The only other places I see using this in the planner are for costing
> purposes. Also, the header comment for that function says it's not
> safe. Particularly "This effectively means that we plan using the
> first supplied value of the Param.". If that's the case, then if we're
> planning a generic plan, then wouldn't it be possible that the planner
> chooses the current supplied parameter value and prune away partitions
> based on that value. That would make the plan invalid for any other
> parameter, but it's meant to be a generic plan, so we can't do that.

You might be right.  Perhaps, I was thinking of eval_const_expressions()
there, which I guess should be enough for this purpose.

>
> 45. Why use a list_copy() here?
>
> /*
> * For a nested ArrayExpr, we don't know how to get the
> * actual scalar values out into a flat list, so we give
> * up doing anything with this ScalarArrayOpExpr.
> */
> if (arrexpr->multidims)
> continue;
>
> elem_exprs = list_copy(arrexpr->elements);

I guess I was just being paranoid there.  No need, so removed.

On 2018/01/10 7:35, David Rowley wrote:
> There's a small problem with get_partitions_for_keys_list(), the
> following case hits the Assert(false) at the bottom of that function.
>
> create table ab_c (a int not null, b char) partition by list(a);
> create table abc_a2 (b char, a int not null) partition by list(b);
> create table abc_a2_b3 partition of abc_a2 for values in ('3');
> alter table ab_c attach partition abc_a2 for values in (2);
>
> select * from ab_c where a between 1 and 2 and b <= '2';

I removed that Assert per your comment above (in an earlier email of yours).

On 2018/01/10 10:55, David Rowley wrote:
> One more thing I discovered while troubleshooting a bug Beena reported
> in the run-time partition pruning patch is that
> classify_partition_bounding_keys properly does;
>
> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
>     constexpr = rightop;
> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
>     constexpr = leftop;
> else
>     /* Clause not meant for this column. */
>     continue;
>
> for OpExpr clauses, but does not do the same for leftop for the
> ScalarArrayOpExpr test.

I'm not sure why we'd need to do that?  Does the syntax of clauses that
use a ScalarArrayOpExpr() allow them to have the partition key on RHS?
Can you point me to the email where Beena reported the problem in question?

On 2018/01/10 13:18, David Rowley wrote:
> Beena's tests on the run-time partition pruning patch also indirectly
> exposed a problem with this patch.
>
> Basically, the changes to add_paths_to_append_rel() are causing
> duplication in partition_rels.
>
> A test case is:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list (b);
> create table part2 partition of part1 for values in(1);
>
> select * from part;
>
> partition_rels ends up with 3 items in the list, but there's only 2
> partitions here. The reason for this is that, since planning here is
> recursively calling add_paths_to_append_rel, the list for part ends up
> with itself and part1 in it, then since part1's list already contains
> itself, per set_append_rel_size's "rel->live_partitioned_rels =
> list_make1_int(rti);", then part1 ends up in the list twice.

It seems that I found the problem.  Currently, the set_append_rel_size()
step already accumulates the full list in the root partitioned table's
rel->live_partitioned_rels, that is, the list of RT indexes of *all*
partitioned relations in the tree.  Then, when add_paths_to_append_rel()
tries to accumulate child rel's live_partitioned_rels into the parent's,
duplication occurs, because the latter already contains all the entries as
compiled by the earlier step.  I think having only the latter do the
accumulation is better, because even partition-wise join code needs this
facility and it only ever calls add_paths_to_append_rel().


Attached updated version of the patch set containing fixes for almost all
the things mentioned above.


Thanks again for your thoughtful review comments and sorry that I couldn't
reply sooner.

Thanks,
Amit

On 2018/01/11 19:23, David Rowley wrote:
> On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/01/10 10:55, David Rowley wrote:
>>> One more thing I discovered while troubleshooting a bug Beena reported
>>> in the run-time partition pruning patch is that
>>> classify_partition_bounding_keys properly does;
>>>
>>> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
>>>     constexpr = rightop;
>>> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
>>>     constexpr = leftop;
>>> else
>>>     /* Clause not meant for this column. */
>>>     continue;
>>>
>>> for OpExpr clauses, but does not do the same for leftop for the
>>> ScalarArrayOpExpr test.
>>
>> I'm not sure why we'd need to do that?  Does the syntax of clauses that
>> use a ScalarArrayOpExpr() allow them to have the partition key on RHS?
> 
> No, but there's no test to ensure the leftop matches the partition key.
> 
> There's just:
> 
> ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
> Oid saop_op = saop->opno;
> Oid saop_opfuncid = saop->opfuncid;
> Oid saop_coll = saop->inputcollid;
> Node   *leftop = (Node *) linitial(saop->args),
>    *rightop = (Node *) lsecond(saop->args);
> List   *elem_exprs,
>    *elem_clauses;
> ListCell *lc1;
> bool negated = false;
> 
> /*
> * In case of NOT IN (..), we get a '<>', which while not
> * listed as part of any operator family, we are able to
> * handle the same if its negator is indeed a part of the
> * partitioning operator family.
> */
> if (!op_in_opfamily(saop_op, partopfamily))
> {
> Oid negator = get_negator(saop_op);
> int strategy;
> Oid lefttype,
> righttype;
> 
> 
> if (!OidIsValid(negator))
> continue;
> get_op_opfamily_properties(negator, partopfamily, false,
>    &strategy,
>    &lefttype, &righttype);
> if (strategy == BTEqualStrategyNumber)
> negated = true;
> }
> 
> Since there's nothing to reject the clause that does not match the
> partition key, the IN's left operand might be of any random type, and
> may well not be in partopfamily, so when it comes to looking up
> get_op_opfamily_properties() you'll hit: elog(ERROR, "operator %u is
> not a member of opfamily %u", opno, opfamily);

Ah, I completely missed that.  So we need something like the following in
this IsA(clause, ScalarArrayOpExpr) block:


+                /* Clause does not match this partition key. */
+                if (!EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
+                    continue;
+

> Still looking at the v17 patch here, but I also don't see a test to
> see if the IsBooleanOpfamily(partopfamily) is checking it matches the
> partition key.

You're right.  Added checks there as well.

>> Can you point me to the email where Beena reported the problem in question?
> 
> https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com
>
> To save you from having to look at the run-time prune patch, here's
> case that break in v18.
>
> create table xy (a int, b text) partition by range (a,b);
> create table xy1 partition of xy for values from (0,'a') to (10, 'b');
> select * from xy where a = 1 and b in('x','y');
> ERROR:  operator 531 is not a member of opfamily 1976

You'll be able to see that the error no longer appears with the attached
updated set of patches, but I'm now seeing that the resulting plan with
patched for this particular query differs from what master (constraint
exclusion) produces.  Master produces a plan with no partitions (as one
would think is the correct plan), whereas patched produces a plan
including the xy1 partition.  I will think about that a bit and post
something later.

Thanks,
Amit

Вложения

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning

От

David Rowley

Дата:

12 января 2018 г., 09:30:38

On 12 January 2018 at 15:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/11 19:23, David Rowley wrote:

>> ERROR:  operator 531 is not a member of opfamily 1976
>
> You'll be able to see that the error no longer appears with the attached
> updated set of patches, but I'm now seeing that the resulting plan with
> patched for this particular query differs from what master (constraint
> exclusion) produces.  Master produces a plan with no partitions (as one
> would think is the correct plan), whereas patched produces a plan
> including the xy1 partition.  I will think about that a bit and post
> something later.

Thanks for looking at that.

I've got a few more things for you. I'm only partway through another
pass, but it makes sense to post what I have now if you're working on
a new version.

1. partitioing -> partitioning

 * Strategy of a partition clause operator per the partitioing operator class

2. get_partitions_from_clauses() modifies partclauses without
mentioning it in the header. I think you need to either:

a) warn about this in the header comment; or
b) do a list_copy() before list_concat()
c) do list_truncate back to the original length after you're done with the list.

3. get_partitions_from_clauses_recurse(), with:

  result = bms_add_range(result, 0, partdesc->nparts - 1);

You could change that to bms_add_range(NULL, ...) and ditch the
assignment of result to NULL at the start of the function.

4. classify_partition_bounding_keys() now returns bool, but the return
statement is still:

return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness;

my compiler didn't warn about that, but I'd imagine some might.

Instead, can you make it:

if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
    keys->n_maxkeys > 0 || n_keynullness > 0)
    return true;

return false;

probably equal keys are the most likely case, so it'll be good to
short circuit instead of performing addition on a bunch of stuff we
don't care about anymore.

5. In classify_partition_bounding_keys, why do we "continue" here?

clause = rinfo->clause;
if (rinfo->pseudoconstant &&
!DatumGetBool(((Const *) clause)->constvalue))
{
*constfalse = true;
continue;
}

Is there any point in searching further?

Also, if you were consistent with the return value for
classify_partition_bounding_keys when you've set *constfalse = true;
you wouldn't need to handle the case twice like you are in
get_partitions_from_clauses_recurse().

6. I think it would be nicer if get_partitions_from_ne_clauses returns
a set of partitions that could be excluded.

So instead of:

 * get_partitions_from_ne_clauses
 *
 * Return partitions of relation that satisfy all <> operator clauses in
 * ne_clauses.  Only ever called if relation is a list partitioned table.

Have:

 * get_partitions_from_ne_clauses
 *
 * Returns a Bitmapset of partitions that can be safely excluded due to
 * not-equal clauses existing for all possible partition values. It is only
 * valid to call this for LIST partitioned tables.

and instead of:

result = bms_add_range(NULL, 0, partdesc->nparts - 1);
result = bms_del_members(result, excluded_parts);
bms_free(excluded_parts);

return result;

Just do:

return excluded_parts;

and in get_partitions_from_clauses_recurse(), do bms_del_members
instead of bms_int_members.

there's less bit shuffling and it seems cleaner. Perhaps the function
name would need to be changed if we're inverting the meaning too.

(I've attached a patch which makes this change along with an idea in #8 below)

7. The following comment claims the function sets *datum, but there's
no param by that name:

/*
 * partkey_datum_from_expr
 * Extract constant value from expr and set *datum to that value
 */
static bool
partkey_datum_from_expr(PartitionKey key, int partkeyidx,
Expr *expr, Datum *value)

8. The code in get_partitions_from_ne_clauses() does perform quite a
few nested loops. I think a more simple way to would be to track the
offsets you've seen in a Bitmapset. This would save you having to
check for duplicates, as an offset can only contain a single datum.
You'd just need to build a couple of arrays after that, one to sum up
the offsets found per partition, and one for the total datums allowed
in the partition. If the numbers match then you can remove the
partition.

I've written this and attached it to this email. It saves about 50
lines of code and should perform much better for complex cases, for
example, a large NOT IN list. This also implements #6.

9. "the same" -> "it"

/*
* In case of NOT IN (..), we get a '<>', which while not
* listed as part of any operator family, we are able to
* handle the same if its negator is indeed a part of the
* partitioning operator family.
*/

10. in classify_partition_bounding_keys: "0" -> "false"

/* Return if no work to do below. */
if (!will_compute_keys || *constfalse)
return 0;

Likewise for:

if (*constfalse)
return 0;

11. I don't see partition_bound_bsearch used anywhere below the
following comment:

* Generate bounding tuple(s).
*
* We look up partitions in the partition bound descriptor using, say,
* partition_bound_bsearch(), which expects a Datum (or Datums if multi-
* column key).  So, extract the same out of the constant argument of
* each clause.

I also don't know what the comment is trying to say.

12.

 * operator and sets *incl if equality is implied

should be:

 * operator and set *incl to true if the operator's strategy is inclusive.

13. What does "the same" mean in:

* and add this one directly to the result.  Caller would
* arbitrarily choose one of the many and perform
* partition-pruning with the same.  It's possible that mutual

I think you quite often use "the same" to mean "it". Can you change that?

14. Not sure what parameter you're talking about here.

 * Evaluate 'leftarg op rightarg' and set *result to its value.
 *
 * leftarg and rightarg referred to above actually refer to the constant
 * operand (Datum) of the clause contained in the parameters leftarg and
 * rightarg below, respectively.  And op refers to the operator of the
 * clause contained in the parameter op below.

15. "the latter" is normally used when you're referring to the last
thing in a list which was just mentioned. In this case, leftarg_const
and rightarg_const is the list, so "the latter" should mean
rightarg_const, but I think you mean to compare them using the
operator.

* If the leftarg_const and rightarg_const are both of the type expected
* by op's operator, then compare them using the latter.

16. There are a few things to improve with the following comment:

/*
* Hash partitioning stores partition keys containing nulls in regular
* partitions.  That is, the code that determines the hash partition for
* a given row admits nulls in the partition key when computing the key's
* hash.  So, here we treat any IS NULL clauses on partition key columns as
* equality keys, along with any other non-null values coming from equality
* operator clauses.
*/

"admits" is not the correct word here, and "hash" should be "correct",
but there are more mistakes, so might be easier just to rewrite to:

/*
* Since tuples with NULL values in the partition key columns are
stored in regular partitions,
* we'll treat any IS NULL clauses here as regular equality clauses.
/*

17. The following example will cause get_partitions_for_keys_hash to misbehave:

create table hashp (a int, b int) partition by hash (a, b);
create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
explain select * from hashp where a = 1 and a is null;

The following code assumes that you'll never get a NULL test for a key
that has an equality test, and ends up trying to prune partitions
thinking we got compatible clauses for both partition keys.

  memset(keyisnull, false, sizeof(keyisnull));
  for (i = 0; i < partkey->partnatts; i++)
  {
    if (bms_is_member(i, keys->keyisnull))
    {
      keys->n_eqkeys++;
      keyisnull[i] = true;
    }
  }

  /*
   * Can only do pruning if we know all the keys and they're all equality
   * keys including the nulls that we just counted above.
   */
  if (keys->n_eqkeys == partkey->partnatts)

The above code will need to be made smarter. It'll likely crash if you
change "b" to a pass-by-ref type.

18. The following code:

int other_idx = -1;

/*
* Only a designated partition accepts nulls, which if there
* exists one, return the same.
*/
if (partition_bound_accepts_nulls(boundinfo) ||
partition_bound_has_default(boundinfo))
other_idx = partition_bound_accepts_nulls(boundinfo)
? boundinfo->null_index
: boundinfo->default_index;
if (other_idx >= 0)
return bms_make_singleton(other_idx);
else
return NULL;

 should be simplified to:

 /*
  * NULLs may only exist in the NULL partition, or in the
  * default, if there's no NULL partition.
  */
 if (partition_bound_accepts_nulls(boundinfo))
return bms_make_singleton(boundinfo->null_index);
 else if (partition_bound_has_default(boundinfo))
return bms_make_singleton(boundinfo->default_index);
 return NULL;

19. "exists" -> "are"

* If there are no datums to compare keys with, but there exist
* partitions, it must be the default partition.

also, instead of writing "it must be the default partition." it should
be better to say "just return the default partition."

20. I don't think the return NULL should ever hit, is it worth putting
a comment to say /* shouldn't happen */

if (boundinfo->ndatums == 0)
{
if (partition_bound_has_default(boundinfo))
return bms_make_singleton(boundinfo->default_index);
else
return NULL;
}

21. Can the following comment does not explain the situation well:

/*
* boundinfo->ndatums - 1 is the last valid list partition datums
* index.
*/

There's really no possible non-default partition for this case, so
perhaps we should just return the default, if one exists. We do go on
to check the n_maxkeys needlessly for this case. At the very least the
comment should be changed to:

 /*
  * minkeys values are greater than any non-default partition.
  * We'll check that for case below.
  */

but I think it's worth just doing the default partition check there
and returning it, or NULL. It should help reduce confusion.

Can you also perform a self-review of the patch? Some of the things
I'm picking up are leftovers from a previous version of the patch. We
might never get through this review if you keep leaving those around!

I won't continue reviewing again until next week, so don't rush.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

get_partitions_excluded_by.patch

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

12 января 2018 г., 10:12:21

David,

On 2018/01/12 12:30, David Rowley wrote:
> Can you also perform a self-review of the patch? Some of the things
> I'm picking up are leftovers from a previous version of the patch. We
> might never get through this review if you keep leaving those around!

Sorry, I will look more closely before posting the next version.  I guess
I may have rushed a bit too much when posting the v18/v19 patches, partly
because it's been 3 weeks since v17 and I felt I needed to catch up
quickly given the activity on the run-time pruning thread which depends on
the patches here.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

12 января 2018 г., 15:09:32

On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Basically, the changes to add_paths_to_append_rel() are causing
> duplication in partition_rels.
>
> A test case is:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list (b);
> create table part2 partition of part1 for values in(1);
>
> select * from part;
>
> partition_rels ends up with 3 items in the list, but there's only 2
> partitions here. The reason for this is that, since planning here is
> recursively calling add_paths_to_append_rel, the list for part ends up
> with itself and part1 in it, then since part1's list already contains
> itself, per set_append_rel_size's "rel->live_partitioned_rels =
> list_make1_int(rti);", then part1 ends up in the list twice.
>
> It would be nicer if you could use a RelIds for this, but you'd also
> need some way to store the target partition relation since
> nodeModifyTable.c does:
>
> /* The root table RT index is at the head of the partitioned_rels list */
> if (node->partitioned_rels)
> {
>     Index root_rti;
>     Oid root_oid;
>
>     root_rti = linitial_int(node->partitioned_rels);
>     root_oid = getrelid(root_rti, estate->es_range_table);
>     rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
> }
>
> You could also fix it by instead of doing:
>
> /*
> * Accumulate the live partitioned children of this child, if it's
> * itself partitioned rel.
> */
> if (childrel->part_scheme)
>     partitioned_rels = list_concat(partitioned_rels,
>        childrel->live_partitioned_rels);
>
> do something along the lines of:
>
> if (childrel->part_scheme)
> {
>     ListCell *lc;
>     ListCell *start = lnext(list_head(childrel->live_partitioned_rels));
>
>     for_each_cell(lc, start)
>        partitioned_rels = lappend_int(partitioned_rels,
>        lfirst_int(lc));
> }
>
> Although it seems pretty fragile. It would probably be better to find
> a nicer way of handling all this.

Hi Amit,

I also noticed earlier that this is still broken in v19.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

Amit Langote

Дата:

12 января 2018 г., 15:51:20

Hi.

On 2018/01/12 18:09, David Rowley wrote:
> On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Basically, the changes to add_paths_to_append_rel() are causing
>> duplication in partition_rels.

[ ... ]

> I also noticed earlier that this is still broken in v19.

I cannot see the duplication here (with v19 + some local changes per your
latest review, although I had fixed the issue in v18).

create table part (a int, b int) partition by list(a);
create table part1 partition of part for values in (1) partition by list (b);
create table part2 partition of part1 for values in (1);

select * from part;

For the above query, I set a breakpoint all the way in ExecInitAppend() to
see what partitioned_rels list it ends up with and I see no duplication:

   :partitioned_rels (i 1 3)

where 1 and 3 are RT indexes of part and part1, respectively.

With v17, you'd be able to see the duplication:

   :partitioned_rels (i 1 3 3)

Let me confirm again if you were complaining exactly of this duplication?
That the RT index of part1 appears twice due to the bug I claim I fixed i
v18?  Or something else?

Thanks,
Amit

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

От

David Rowley

Дата:

12 января 2018 г., 15:58:32

On 12 January 2018 at 22:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/12 18:09, David Rowley wrote:
>> On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> Basically, the changes to add_paths_to_append_rel() are causing
>>> duplication in partition_rels.
>
>> I also noticed earlier that this is still broken in v19.
>
> I cannot see the duplication here (with v19 + some local changes per your
> latest review, although I had fixed the issue in v18).

I may have made a mistake there. The code I expected to change didn't.
I meant to test the case again, but I got distracted just before I did
and came back a while later and forgot that I hadn't tested.

If you've tested my case and it works, then please don't look any
further. I will look when v20 is ready.

Sorry for the false alarm... I must've been trying to do too many
things at once :-(

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning

От

Robert Haas

Дата:

13 января 2018 г., 01:10:17

On Thu, Jan 11, 2018 at 10:30 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Instead, can you make it:
>
> if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
>     keys->n_maxkeys > 0 || n_keynullness > 0)
>     return true;
>
> return false;

Or better yet:

return (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
    keys->n_maxkeys > 0 || n_keynullness > 0);

It's not really necessary to write if (some condition is true) return
true; return false when you can just write return (boolean-valued
condition).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning

От

Amit Langote

Дата:

16 января 2018 г., 14:08:50

Hi David.

Thanks for the review.

On 2018/01/12 12:30, David Rowley wrote:
> I've got a few more things for you. I'm only partway through another
> pass, but it makes sense to post what I have now if you're working on
> a new version.
> 
> 1. partitioing -> partitioning
> 
>  * Strategy of a partition clause operator per the partitioing operator class

Fixed.

> 2. get_partitions_from_clauses() modifies partclauses without
> mentioning it in the header. I think you need to either:
> 
> a) warn about this in the header comment; or
> b) do a list_copy() before list_concat()
> c) do list_truncate back to the original length after you're done with the list.

Went with (b).

> 3. get_partitions_from_clauses_recurse(), with:
> 
>   result = bms_add_range(result, 0, partdesc->nparts - 1);
> 
> You could change that to bms_add_range(NULL, ...) and ditch the
> assignment of result to NULL at the start of the function.

Done.

> 4. classify_partition_bounding_keys() now returns bool, but the return
> statement is still:
> 
> return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness;
> 
> my compiler didn't warn about that, but I'd imagine some might.

Oops, my bad.

> 
> Instead, can you make it:
> 
> if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
>     keys->n_maxkeys > 0 || n_keynullness > 0)
>     return true;
> 
> return false;
>
> probably equal keys are the most likely case, so it'll be good to
> short circuit instead of performing addition on a bunch of stuff we
> don't care about anymore.

Changed it to what Robert suggested downthread.

> 5. In classify_partition_bounding_keys, why do we "continue" here?
> 
> clause = rinfo->clause;
> if (rinfo->pseudoconstant &&
> !DatumGetBool(((Const *) clause)->constvalue))
> {
> *constfalse = true;
> continue;
> }
> 
> Is there any point in searching further?
> 
> Also, if you were consistent with the return value for
> classify_partition_bounding_keys when you've set *constfalse = true;
> you wouldn't need to handle the case twice like you are in
> get_partitions_from_clauses_recurse().

OK, I made classify_partition_bounding_keys() return true whenever set
*constfalse to true.

> 6. I think it would be nicer if get_partitions_from_ne_clauses returns
> a set of partitions that could be excluded.
> 
> So instead of:
> 
>  * get_partitions_from_ne_clauses
>  *
>  * Return partitions of relation that satisfy all <> operator clauses in
>  * ne_clauses.  Only ever called if relation is a list partitioned table.
> 
> Have:
> 
>  * get_partitions_from_ne_clauses
>  *
>  * Returns a Bitmapset of partitions that can be safely excluded due to
>  * not-equal clauses existing for all possible partition values. It is only
>  * valid to call this for LIST partitioned tables.
> 
> and instead of:
> 
> result = bms_add_range(NULL, 0, partdesc->nparts - 1);
> result = bms_del_members(result, excluded_parts);
> bms_free(excluded_parts);
> 
> return result;
> 
> Just do:
> 
> return excluded_parts;
> 
> and in get_partitions_from_clauses_recurse(), do bms_del_members
> instead of bms_int_members.
> 
> there's less bit shuffling and it seems cleaner. Perhaps the function
> name would need to be changed if we're inverting the meaning too.
> 
> (I've attached a patch which makes this change along with an idea in #8 below)

Thanks for the suggestions... (comment continues below)

> 7. The following comment claims the function sets *datum, but there's
> no param by that name:
> 
> /*
>  * partkey_datum_from_expr
>  * Extract constant value from expr and set *datum to that value
>  */
> static bool
> partkey_datum_from_expr(PartitionKey key, int partkeyidx,
> Expr *expr, Datum *value)

Fixed.

> 8. The code in get_partitions_from_ne_clauses() does perform quite a
> few nested loops. I think a more simple way to would be to track the
> offsets you've seen in a Bitmapset. This would save you having to
> check for duplicates, as an offset can only contain a single datum.
> You'd just need to build a couple of arrays after that, one to sum up
> the offsets found per partition, and one for the total datums allowed
> in the partition. If the numbers match then you can remove the
> partition.
> 
> I've written this and attached it to this email. It saves about 50
> lines of code and should perform much better for complex cases, for
> example, a large NOT IN list. This also implements #6.

I liked your patch, so incorporated it, except, I feel slightly
uncomfortable about the new name that you chose for the function because
it sounds a bit generic.  I mean the function solves a very specific
problem and have very strict requirements for calling it.  It's not like
we could pass it just any partitioned relation and/or just any set of
clauses.  It has to be a list-partitioned table and the list of clauses
must contain only the clauses containing compatible <> operators.  Checks
for those requirements are carried out in yet another place, that is,
classify_partition_bounding_keys().

Perhaps we can live with that though, because it's not a publicly
available function, but someone might get confused in the future.

> 9. "the same" -> "it"
> 
> /*
> * In case of NOT IN (..), we get a '<>', which while not
> * listed as part of any operator family, we are able to
> * handle the same if its negator is indeed a part of the
> * partitioning operator family.
> */

Done.

> 10. in classify_partition_bounding_keys: "0" -> "false"
> 
> /* Return if no work to do below. */
> if (!will_compute_keys || *constfalse)
> return 0;
> 
> Likewise for:
> 
> if (*constfalse)
> return 0;

Have fixed these per an earlier comment in this email.

> 11. I don't see partition_bound_bsearch used anywhere below the
> following comment:
> 
> * Generate bounding tuple(s).
> *
> * We look up partitions in the partition bound descriptor using, say,
> * partition_bound_bsearch(), which expects a Datum (or Datums if multi-
> * column key).  So, extract the same out of the constant argument of
> * each clause.
> 
> I also don't know what the comment is trying to say.

The comment is no longer very intelligible to me too.  I just wanted to
say here that, *elsewhere*, we will use a function like
partition_bound_bsearch() to look up partitions from the clauses we
matched against individual partition key columns.  That function expects
the lookup key to be in a Datum array form, not a list-of-clauses form.
So, we must construct the lookup key(s) by extracting constant values out
the clauses.

I tried to rewrite it that way.  Hope that's a bit clearer.

> 12.
> 
>  * operator and sets *incl if equality is implied
> 
> should be:
> 
>  * operator and set *incl to true if the operator's strategy is inclusive.

Done that way.

> 13. What does "the same" mean in:
> 
> * and add this one directly to the result.  Caller would
> * arbitrarily choose one of the many and perform
> * partition-pruning with the same.  It's possible that mutual

It means "the one that caller would arbitrarily choose of the many that
this function will return to it".  Anyway, I changed "the same" to "it".

> I think you quite often use "the same" to mean "it". Can you change that?

I guess that's just one of my many odd habits when writing English, all of
which I'm trying to get rid of, but apparently with limited success.  Must
try harder. :)

> 14. Not sure what parameter you're talking about here.
> 
>  * Evaluate 'leftarg op rightarg' and set *result to its value.
>  *
>  * leftarg and rightarg referred to above actually refer to the constant
>  * operand (Datum) of the clause contained in the parameters leftarg and
>  * rightarg below, respectively.  And op refers to the operator of the
>  * clause contained in the parameter op below.

I rewrote the above comment block as:

 * Try to compare the constant arguments of 'leftarg' and 'rightarg', in that
 * order, using the operator of 'op' and set *result to the result of this
 * comparison.

Is that any better?

> 15. "the latter" is normally used when you're referring to the last
> thing in a list which was just mentioned. In this case, leftarg_const
> and rightarg_const is the list, so "the latter" should mean
> rightarg_const, but I think you mean to compare them using the
> operator.
> 
> * If the leftarg_const and rightarg_const are both of the type expected
> * by op's operator, then compare them using the latter.

Rewrote it as:

     * We can compare leftarg_const and rightarg_const using op's operator
     * only if both are of the type expected by it.

> 16. There are a few things to improve with the following comment:
> 
> /*
> * Hash partitioning stores partition keys containing nulls in regular
> * partitions.  That is, the code that determines the hash partition for
> * a given row admits nulls in the partition key when computing the key's
> * hash.  So, here we treat any IS NULL clauses on partition key columns as
> * equality keys, along with any other non-null values coming from equality
> * operator clauses.
> */
> 
> "admits" is not the correct word here, and "hash" should be "correct",
> but there are more mistakes, so might be easier just to rewrite to:
> 
> /*
> * Since tuples with NULL values in the partition key columns are
> stored in regular partitions,
> * we'll treat any IS NULL clauses here as regular equality clauses.
> /*

Agreed that your version is better, so went with it.

> 17. The following example will cause get_partitions_for_keys_hash to misbehave:
> 
> create table hashp (a int, b int) partition by hash (a, b);
> create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
> create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
> create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
> create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
> explain select * from hashp where a = 1 and a is null;
> 
> The following code assumes that you'll never get a NULL test for a key
> that has an equality test, and ends up trying to prune partitions
> thinking we got compatible clauses for both partition keys.

Yeah, I think this example helps spot a problem.  I thought we'd never get
to get_partitions_for_keys_hash() for the above query, because we
should've been able to prove much earlier that the particular clause
combination should be always false (a cannot be both non-null 1 and null).
 Now, because the planner itself doesn't substitute a constant-false for
that, I taught classify_partition_bounding_keys() to do so.  It would now
return constfalse=true if it turns out that clauses in the input list lead
to contradictory nullness condition for a given partition column.

>   memset(keyisnull, false, sizeof(keyisnull));
>   for (i = 0; i < partkey->partnatts; i++)
>   {
>     if (bms_is_member(i, keys->keyisnull))
>     {
>       keys->n_eqkeys++;
>       keyisnull[i] = true;
>     }
>   }
> 
>   /*
>    * Can only do pruning if we know all the keys and they're all equality
>    * keys including the nulls that we just counted above.
>    */
>   if (keys->n_eqkeys == partkey->partnatts)
> 
> The above code will need to be made smarter. It'll likely crash if you
> change "b" to a pass-by-ref type.

Hmm, not sure why.  It seems to work:

create table hp (a int, b text) partition by hash (a, b);
create table hp1 partition of hp for values with (modulus 4, remainder 0);
create table hp2 partition of hp for values with (modulus 4, remainder 1);
create table hp3 partition of hp for values with (modulus 4, remainder 3);
create table hp4 partition of hp for values with (modulus 4, remainder 2);

insert into hp values (1, 'xxx');
INSERT 0 1

select tableoid::regclass, * from hp;
 tableoid | a |  b
----------+---+-----
 hp1   | 1 | xxx
(1 row)

insert into hp (a) values (1);
INSERT 0 1

insert into hp (b) values ('xxx');
INSERT 0 1

select tableoid::regclass, * from hp where a is null;
 tableoid | a |  b
----------+---+-----
 hp2   |   | xxx
(1 row)

select tableoid::regclass, * from hp where b is null;
 tableoid | a | b
----------+---+---
 hp1   | 1 |
(1 row)

select tableoid::regclass, * from hp where a = 1 and b is null;
 tableoid | a | b
----------+---+---
 hp1   | 1 |
(1 row)

select tableoid::regclass, * from hp where a is null and b = 'xxx';
 tableoid | a |  b
----------+---+-----
 hp2   |   | xxx
(1 row)

> 18. The following code:
> 
> int other_idx = -1;
> 
> /*
> * Only a designated partition accepts nulls, which if there
> * exists one, return the same.
> */
> if (partition_bound_accepts_nulls(boundinfo) ||
> partition_bound_has_default(boundinfo))
> other_idx = partition_bound_accepts_nulls(boundinfo)
> ? boundinfo->null_index
> : boundinfo->default_index;
> if (other_idx >= 0)
> return bms_make_singleton(other_idx);
> else
> return NULL;
> 
>  should be simplified to:
> 
>  /*
>   * NULLs may only exist in the NULL partition, or in the
>   * default, if there's no NULL partition.
>   */
>  if (partition_bound_accepts_nulls(boundinfo))
> return bms_make_singleton(boundinfo->null_index);
>  else if (partition_bound_has_default(boundinfo))
> return bms_make_singleton(boundinfo->default_index);
>  return NULL;

Agreed, done that way.

> 19. "exists" -> "are"
> 
> * If there are no datums to compare keys with, but there exist
> * partitions, it must be the default partition.
> 
> also, instead of writing "it must be the default partition." it should
> be better to say "just return the default partition."

OK, done.

> 20. I don't think the return NULL should ever hit, is it worth putting
> a comment to say /* shouldn't happen */
> 
> if (boundinfo->ndatums == 0)
> {
> if (partition_bound_has_default(boundinfo))
> return bms_make_singleton(boundinfo->default_index);
> else
> return NULL;
> }

I added a /* shouldn't happen */ comment next to return NULL.

> 21. Can the following comment does not explain the situation well:
> 
> /*
> * boundinfo->ndatums - 1 is the last valid list partition datums
> * index.
> */
> 
> There's really no possible non-default partition for this case, so
> perhaps we should just return the default, if one exists. We do go on
> to check the n_maxkeys needlessly for this case. At the very least the
> comment should be changed to:
> 
>  /*
>   * minkeys values are greater than any non-default partition.
>   * We'll check that for case below.
>   */
> 
> but I think it's worth just doing the default partition check there
> and returning it, or NULL. It should help reduce confusion.

Yep, done.


Attached v20.  Thanks again.

Regards,
Amit

On 17 January 2018 at 17:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
> 6. Which brings me to; why do we need match_clauses_to_partkey at all?
> classify_partition_bounding_keys seems to do all the work
> match_clauses_to_partkey does, plus more. Item #3 above is caused by
> an inconsistency between these functions. What benefit does
> match_clauses_to_partkey give? I might understand if you were creating
> list of clauses matching each partition key, but you're just dumping
> everything in one big list which causes
> classify_partition_bounding_keys() to have to match each clause to a
> partition key again, and classify_partition_bounding_keys is even
> coded to ignore clauses that don't' match any key, so it makes me
> wonder what is match_clauses_to_partkey actually for?

I started to look at this and ended up shuffling the patch around a
bit to completely remove the match_clauses_to_partkey function.

I also cleaned up some of the comments and shuffled some fields around
in some of the structs to shrink them down a bit.

All up, this has saved 268 lines of code in the patch.

src/backend/catalog/partition.c       | 296 ++++++++++++++++-----------
src/backend/optimizer/path/allpaths.c | 368 ++--------------------------------
2 files changed, 198 insertions(+), 466 deletions(-)

It's had very minimal testing. Really I've only tested that the
regression tests pass.

I also fixed up the bad assumption that IN lists will contain Consts
only which hopefully fixes the crash I reported earlier.

I saw you'd added a check to look for contradicting IS NOT NULL
clauses when processing an IS NULL clause, but didn't do anything for
the opposite case. I added code for this so it behaves the same
regardless of the clause order.

Can you look at my changes and see if I've completely broken anything?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

faster_partition_prune_v20_delta_drowley.patch

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning

От

Amit Langote

Дата:

17 января 2018 г., 16:48:21

Hi David.

On Wed, Jan 17, 2018 at 6:19 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 17 January 2018 at 17:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> 6. Which brings me to; why do we need match_clauses_to_partkey at all?
>> classify_partition_bounding_keys seems to do all the work
>> match_clauses_to_partkey does, plus more. Item #3 above is caused by
>> an inconsistency between these functions. What benefit does
>> match_clauses_to_partkey give? I might understand if you were creating
>> list of clauses matching each partition key, but you're just dumping
>> everything in one big list which causes
>> classify_partition_bounding_keys() to have to match each clause to a
>> partition key again, and classify_partition_bounding_keys is even
>> coded to ignore clauses that don't' match any key, so it makes me
>> wonder what is match_clauses_to_partkey actually for?
>
> I started to look at this and ended up shuffling the patch around a
> bit to completely remove the match_clauses_to_partkey function.
>
> I also cleaned up some of the comments and shuffled some fields around
> in some of the structs to shrink them down a bit.
>
> All up, this has saved 268 lines of code in the patch.
>
> src/backend/catalog/partition.c       | 296 ++++++++++++++++-----------
> src/backend/optimizer/path/allpaths.c | 368 ++--------------------------------
> 2 files changed, 198 insertions(+), 466 deletions(-)
>
> It's had very minimal testing. Really I've only tested that the
> regression tests pass.
>
> I also fixed up the bad assumption that IN lists will contain Consts
> only which hopefully fixes the crash I reported earlier.
>
> I saw you'd added a check to look for contradicting IS NOT NULL
> clauses when processing an IS NULL clause, but didn't do anything for
> the opposite case. I added code for this so it behaves the same
> regardless of the clause order.
>
> Can you look at my changes and see if I've completely broken anything?

Thanks for the patch.  I applied the patch and see that it didn't
break any tests, although haven't closely reviewed the code yet.

I'm concerned that after your patch to remove
match_clauses_to_partkey(), we'd be doing more work than necessary in
some cases.  For example, consider the case of using run-time pruning
for nested loop where the inner relation is a partitioned table.  With
the old approach, get_partitions_from_clauses() would only be handed
the clauses that are known to match the partition keys (which most
likely is fewer than all of the query's clauses), so
get_partitions_from_clauses() doesn't have to do the work of filtering
non-partition clauses every time (that is, for every outer row).
That's why I had decided to keep that part in the planner.

Thanks,
Amit

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning

От

David Rowley

Дата:

17 января 2018 г., 17:13:14

On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote:
> I'm concerned that after your patch to remove
> match_clauses_to_partkey(), we'd be doing more work than necessary in
> some cases.  For example, consider the case of using run-time pruning
> for nested loop where the inner relation is a partitioned table.  With
> the old approach, get_partitions_from_clauses() would only be handed
> the clauses that are known to match the partition keys (which most
> likely is fewer than all of the query's clauses), so
> get_partitions_from_clauses() doesn't have to do the work of filtering
> non-partition clauses every time (that is, for every outer row).
> That's why I had decided to keep that part in the planner.

That might be better served by splitting
classify_partition_bounding_keys() into separate functions, the first
function would be in charge of building keyclauses_all. That way the
remaining work during the executor would never need to match clauses
to a partition key as they'd be in lists dedicated to each key.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning

От

David Rowley

Дата:

18 января 2018 г., 09:14:51

On 18 January 2018 at 00:13, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote:
>> I'm concerned that after your patch to remove
>> match_clauses_to_partkey(), we'd be doing more work than necessary in
>> some cases.  For example, consider the case of using run-time pruning
>> for nested loop where the inner relation is a partitioned table.  With
>> the old approach, get_partitions_from_clauses() would only be handed
>> the clauses that are known to match the partition keys (which most
>> likely is fewer than all of the query's clauses), so
>> get_partitions_from_clauses() doesn't have to do the work of filtering
>> non-partition clauses every time (that is, for every outer row).
>> That's why I had decided to keep that part in the planner.
>
> That might be better served by splitting
> classify_partition_bounding_keys() into separate functions, the first
> function would be in charge of building keyclauses_all. That way the
> remaining work during the executor would never need to match clauses
> to a partition key as they'd be in lists dedicated to each key.

I've attached another delta against your v20 patch which does this.
It's very rough for now and I've only checked that it passes the
regression test so far.

It will need some cleanup work, but I'd be keen to know what you think
of the general idea. I've not fully worked out how run-time pruning
will use this as it'll need another version of
get_partitions_from_clauses but passes in a PartScanClauseInfo
instead, and does not call extract_partition_key_clauses. That area
probably  needs some shuffling around so that does not end up a big
copy and paste of all that new logic.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

faster_partition_prune_v20_delta_drowley_v2.patch

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

18 января 2018 г., 16:56:47

Hi David.

On 2018/01/18 12:14, David Rowley wrote:
> On 18 January 2018 at 00:13, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote:
>>> I'm concerned that after your patch to remove
>>> match_clauses_to_partkey(), we'd be doing more work than necessary in
>>> some cases.  For example, consider the case of using run-time pruning
>>> for nested loop where the inner relation is a partitioned table.  With
>>> the old approach, get_partitions_from_clauses() would only be handed
>>> the clauses that are known to match the partition keys (which most
>>> likely is fewer than all of the query's clauses), so
>>> get_partitions_from_clauses() doesn't have to do the work of filtering
>>> non-partition clauses every time (that is, for every outer row).
>>> That's why I had decided to keep that part in the planner.
>>
>> That might be better served by splitting
>> classify_partition_bounding_keys() into separate functions, the first
>> function would be in charge of building keyclauses_all. That way the
>> remaining work during the executor would never need to match clauses
>> to a partition key as they'd be in lists dedicated to each key.
> 
> I've attached another delta against your v20 patch which does this.
> It's very rough for now and I've only checked that it passes the
> regression test so far.

Thanks!

> It will need some cleanup work, but I'd be keen to know what you think
> of the general idea.

This one looks in a much better shape.

> I've not fully worked out how run-time pruning
> will use this as it'll need another version of
> get_partitions_from_clauses but passes in a PartScanClauseInfo
> instead, and does not call extract_partition_key_clauses. That area
> probably  needs some shuffling around so that does not end up a big
> copy and paste of all that new logic.
So, I've been assuming that the planner changes in the run-time pruning
patch have to do with selecting clauses (restriction clauses not
containing Consts and/or join clauses) to be passed to the executor by
recording them in the Append node.  Will they be selected by the planner
calling into partition.c?

Meanwhile, here is a revised version (v21) that incorporates your changes.
 I added you as the author in 0002 and 0005 patches.  I guess a v22 will
have to follow very soon...

Thanks,
Amit

Hi Amit
,
On 19 January 2018 at 04:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> So, I've been assuming that the planner changes in the run-time pruning
>> patch have to do with selecting clauses (restriction clauses not
>> containing Consts and/or join clauses) to be passed to the executor by
>> recording them in the Append node.  Will they be selected by the planner
>> calling into partition.c?
>
> I had thought so. I only have a rough idea in my head and that's that
> the PartitionPruneInfo struct that I wrote for the run-time pruning
> patch would have the clause List replaced with this new
> PartScanClauseInfo struct (which likely means it needs to go into
> primnodes.h), this struct would contain all the partition pruning
> clauses in a more structured form so that no matching of quals to the
> partition key wouldn't be required during execution. The idea is that
> we'd just need to call; remove_redundant_clauses,
> extract_bounding_datums and get_partitions_for_keys. I think this is
> the bare minimum of work that can be done in execution since we can't
> remove the redundant clauses until we know the values of the Params.
>
> Likely this means there will need to be 2 functions externally
> accessible for this in partition.c. I'm not sure exactly how best to
> do this. Maybe it's fine just to have allpaths.c call
> extract_partition_key_clauses to generate the PartScanClauseInfo then
> call some version of get_partitions_from_clauses which does do the
> extract_partition_key_clauses duties. This is made more complex by the
> code that handles adding the default partition bound to the quals. I'm
> not yet sure where that should live.
>
> I've also been thinking of having some sort of PartitionPruneContext
> struct that we can pass around the functions. Runtime pruning had to
> add structs which store the Param values to various functions which I
> didn't like. It would be good to just add those to the context and
> have them passed down without having to bloat the parameters in the
> functions. I might try and do that tomorrow too. This should make the
> footprint of the runtime pruning patch is a bit smaller.

Attached is what I had in mind about how to do this. Only the planner
will need to call populate_partition_clause_info. The planner and
executor can call get_partitions_from_clauseinfo. I'll just need to
change the run-time prune patch to pass the PartitionClauseInfo into
the executor in the Append node.

I've also added the context struct that I mentioned above. It's
currently not carrying much weight, but the idea is that this context
will be passed around a bit more with the run-time pruning patch and
it will also carry the details about parameter values. I'll need to
modify a few signatures of functions like partkey_datum_from_expr to
pass the context there too. I didn't do that here because currently,
those functions have no use for the context with the fields that they
currently have.

I've also fixed a bug where when you built the commutator OpExpr in
what's now called extract_partition_key_clauses the inputcollid was
not being properly set. The changes I made there go much further than
just that, I've completely removed the OpExpr from the PartClause
struct as only two members were ever used. I thought it was better
just to add those to PartClause instead.

I also did some renaming of variables that always assumed that the
Expr being compared to the partition key was a constant. This is true
now, but with run-time pruning patch, it won't be, so I thought it was
better to do this here rather than having to rename them in the
run-time pruning patch.

One thing I don't yet understand about the patch is the use of
predicate_refuted_by() in get_partitions_from_or_clause_args(). I did
adjust the comment above that code, but I'm still not certain I fully
understand why that function has to be used instead of building the
clauses for the OR being processed along with the remaining clauses.
Is it that this was too hard to extract that you ended up using
predicate_refuted_by()?

I've also removed the makeBoolExpr call that you were encapsulating
the or_clauses in. I didn't really see the need for this since you
just removed it again when looping over the or_clauses.

The only other changes are just streamlining code and comment changes.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

faster_partition_prune_v21_delta_drowley_v1.patch

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

23 января 2018 г., 16:22:37

Hi David.

On 2018/01/23 15:44, David Rowley wrote:
> On 19 January 2018 at 04:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> So, I've been assuming that the planner changes in the run-time pruning
>>> patch have to do with selecting clauses (restriction clauses not
>>> containing Consts and/or join clauses) to be passed to the executor by
>>> recording them in the Append node.  Will they be selected by the planner
>>> calling into partition.c?
>>
>> I had thought so. I only have a rough idea in my head and that's that
>> the PartitionPruneInfo struct that I wrote for the run-time pruning
>> patch would have the clause List replaced with this new
>> PartScanClauseInfo struct (which likely means it needs to go into
>> primnodes.h), this struct would contain all the partition pruning
>> clauses in a more structured form so that no matching of quals to the
>> partition key wouldn't be required during execution. The idea is that
>> we'd just need to call; remove_redundant_clauses,
>> extract_bounding_datums and get_partitions_for_keys. I think this is
>> the bare minimum of work that can be done in execution since we can't
>> remove the redundant clauses until we know the values of the Params.
>>
>> Likely this means there will need to be 2 functions externally
>> accessible for this in partition.c. I'm not sure exactly how best to
>> do this. Maybe it's fine just to have allpaths.c call
>> extract_partition_key_clauses to generate the PartScanClauseInfo then
>> call some version of get_partitions_from_clauses which does do the
>> extract_partition_key_clauses duties. This is made more complex by the
>> code that handles adding the default partition bound to the quals. I'm
>> not yet sure where that should live.
>>
>> I've also been thinking of having some sort of PartitionPruneContext
>> struct that we can pass around the functions. Runtime pruning had to
>> add structs which store the Param values to various functions which I
>> didn't like. It would be good to just add those to the context and
>> have them passed down without having to bloat the parameters in the
>> functions. I might try and do that tomorrow too. This should make the
>> footprint of the runtime pruning patch is a bit smaller.
> 
> Attached is what I had in mind about how to do this.

Thanks for the delta patch.  I will start looking at it tomorrow.

Regards,
Amit

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

24 января 2018 г., 06:04:35

On 23 January 2018 at 23:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/23 15:44, David Rowley wrote:
>> Attached is what I had in mind about how to do this.
>
> Thanks for the delta patch.  I will start looking at it tomorrow.

Thanks. I've been looking more at this and I've made a few more
adjustments in the attached.

This delta patch should be applied against the
faster_partition_prune_v21_delta_drowley_v1.patch one I sent
yesterday. This changes a few comments, also now correctly passes the
context to get_partitions_excluded_by_ne_clauses and fixes a small
error where the patch was failing to record a notnull for the
partition key when it saw a strict <> clause. It was only doing this
for the opposite case, but both seem to be perfectly applicable. I
also made a small adjustment to the regression tests to ensure this is
covered.

I'm now going to start work on basing the partition pruning patch on
top of this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

faster_partition_prune_v21_delta_drowley_v1_delta.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

26 января 2018 г., 15:17:41

Hi David.

I looked at both of your faster_partition_prune_v21_delta_drowley_v1.patch
and faster_partition_prune_v21_delta_drowley_v1_delta.patch and have
incorporated them into the main patch series to get us the attached v22
set.  I like the result very much.  Thank you!

On 2018/01/23 15:44, David Rowley wrote:
> Attached is what I had in mind about how to do this. Only the planner
> will need to call populate_partition_clause_info. The planner and
> executor can call get_partitions_from_clauseinfo. I'll just need to
> change the run-time prune patch to pass the PartitionClauseInfo into
> the executor in the Append node.

I liked the division into those two functions, although not quite the
embedding of "info" into the function names.  I think it's good to just
call them populate_partition_clauses and get_partitions_from_clauses.  It
seems OK though for their arguments to contain "info" in their names.

> I've also added the context struct that I mentioned above. It's
> currently not carrying much weight, but the idea is that this context
> will be passed around a bit more with the run-time pruning patch and
> it will also carry the details about parameter values. I'll need to
> modify a few signatures of functions like partkey_datum_from_expr to
> pass the context there too. I didn't do that here because currently,
> those functions have no use for the context with the fields that they
> currently have.

The context struct seems good to me too.

> I've also fixed a bug where when you built the commutator OpExpr in
> what's now called extract_partition_key_clauses the inputcollid was
> not being properly set. The changes I made there go much further than
> just that, I've completely removed the OpExpr from the PartClause
> struct as only two members were ever used. I thought it was better
> just to add those to PartClause instead.

I wondered if we should rename that to something like
PartClauseProperties, but maybe that's too long.

> I also did some renaming of variables that always assumed that the
> Expr being compared to the partition key was a constant. This is true
> now, but with run-time pruning patch, it won't be, so I thought it was
> better to do this here rather than having to rename them in the
> run-time pruning patch.

OK, seems fine.

> One thing I don't yet understand about the patch is the use of
> predicate_refuted_by() in get_partitions_from_or_clause_args(). I did
> adjust the comment above that code, but I'm still not certain I fully
> understand why that function has to be used instead of building the
> clauses for the OR being processed along with the remaining clauses.
> Is it that this was too hard to extract that you ended up using
> predicate_refuted_by()?

I have tried to explain that better in the updated comment in the new
patch, along with some code rearrangement to better make sense of what's
going on.  Let me just copy-paste the new comment I wrote.  I have tried
to rethink the solution a number of times but never came up with a
sensible alternative.

/*
 * When matching an OR expression, it is only checked if at least one of
 * its args matches the partition key, not all.  For arguments that don't
 * match, we cannot eliminate any of its partitions using
 * get_partitions_from_clauses().  However, if the table is itself a
 * partition, we may be able to prove using constraint exclusion that the
 * clause refutes its partition constraint, that is, we can eliminate all
 * of its partitions.
 */
foreach(lc, or_clause_args)
{

> I've also removed the makeBoolExpr call that you were encapsulating
> the or_clauses in. I didn't really see the need for this since you
> just removed it again when looping over the or_clauses.

Ah, OK.  I first became concerned that you said you were adding arguments
of different OR expressions into a single list and call it or_clauses, but
calmed down after checking that that's not the case. :)

> The only other changes are just streamlining code and comment changes.

I made a few of those myself in the updated patches.

Thanks a lot again for your work on this.

Regards,
Amit

Horiguchi-san,

Thanks for the review.

On 2018/01/30 20:43, Kyotaro HORIGUCHI wrote:
> At Tue, 30 Jan 2018 09:57:44 +0900, Amit Langote wrote:
>> AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make
> I have some random comments on 0002.
> 
> -extract_partition_key_clauses implicitly assumes that the
>  commutator of inequality operator is the same to the original
>  operator. (commutation for such operators is an identity
>  function.)

Yeah, it seems so.

>  I believe it is always true for a sane operator definition, but
>  perhaps wouldn't it be better using commutator instead of
>  opclause->opno as the source of negator if any? (Attached diff 1)

ISTM, the same thing happens with or without the patch.

- pc->opno = OidIsValid(commutator) ? commutator : opclause->opno;
+ pc->opno = comparator;

comparator as added by the patch is effectively equal to the RHS
expression in the deleted line.

> -get_partitions_from_or_clause_args abandons arg_partset with all
>  bit set when partition constraint doesn't refute whole the
>  partition. Finally get_partitions_from_clauses does the same
>  thing but it's waste of cycles and looks weird. It seems to have
>  intended to return immediately there.
> 
>   >       /* Couldn't eliminate any of the partitions. */
>   >       partdesc = RelationGetPartitionDesc(context->relation);
>   > -     arg_partset = bms_add_range(NULL, 0, partdesc->nparts - 1);
>   > +     return bms_add_range(NULL, 0, partdesc->nparts - 1);
>   >     }
>   >  
>   >     subcontext.rt_index = context->rt_index;
>   >     subcontext.relation = context->relation;
>   >     subcontext.clauseinfo = &partclauseinfo;
>  !>     arg_partset = get_partitions_from_clauses(&subcontext);

You're right, fixed.

> -get_partitions_from_or_clause_args converts IN (null) into
>  nulltest and the nulltest doesn't exclude a child that the
>  partition key column can be null.
> 
>  drop table if exists p;
>  create table p (a int, b int) partition by list (a);
>  create table c1 partition of p for values in (1, 5, 7);
>  create table c2 partition of p for values in (4, 6, null);
>  insert into p values (1, 0), (null, 0);
>  
>  explain select tableoid::regclass, * from p where a in (1, null);
>  >                            QUERY PLAN                            
>  > -----------------------------------------------------------------
>  >  Result  (cost=0.00..76.72 rows=22 width=12)
>  >    ->  Append  (cost=0.00..76.50 rows=22 width=12)
>  >          ->  Seq Scan on c1  (cost=0.00..38.25 rows=11 width=12)
>  >                Filter: (a = ANY ('{1,NULL}'::integer[]))
>  >          ->  Seq Scan on c2  (cost=0.00..38.25 rows=11 width=12)
>  >                Filter: (a = ANY ('{1,NULL}'::integer[]))
> 
>  Although the clause "a in (null)" doesn't match the (null, 0)
>  row so it donesn't harm finally, I don't think this is a right
>  behavior. null in an SAOP rightop should be just ignored on
>  partition pruning. Or is there any purpose of this behavior?

Yeah, it seems that we're better off ignoring null values appearing the
IN-list.  Framing a IS NULL or IS NOT NULL expression corresponding to a
null value in the SAOP rightop array doesn't seem to be semantically
correct, as you also pointed out.  In ExecEvalScalarArrayOpExpr(), I see
that a null in the rightop array (IN-list) doesn't lead to selecting rows
containing null in the corresponding column.

> - In extract_bounding_datums, clauseinfo is set twice to the same
>   value.

Oops, my bad when merging in David's patch.

Update patch set attached.  Thanks again.

Regards,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

01 февраля 2018 г., 05:57:41

On 31 January 2018 at 21:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Update patch set attached.  Thanks again.

(My apologies for being slow to respond here. I've been on leave this
week and I'm off again next week. I now have a little time to reply)

Hi Amit,

Thanks for incorporating my changes into the patchset. A while ago I
was rebasing the run-time pruning patch on top of this but ran into a
few problems which are all results of my changes.

1. remove_redundant_clauses collapses the PartClause list into the
most restrictive set of clauses. This disallows multiple evaluations
of the PartitionClauseInfo during runtime pruning. I've written a
proposed fix for this and attached it.

2. PartitionClauseInfo->keyclauses is a list of PartClause which is
not a node type. This will cause _copyPartitionClauseInfo() to fail.

I'm still not quite sure the best way to fix #2 since PartClause
contains a FmgrInfo. I do have a local fix which moves PartClause to
primnodes.h and makes it a proper node type. I also added a copy
function which does not copy any of the cache fields in PartClause. It
just sets valid_cache to false. I didn't particularly think this was
the correct fix. I just couldn't think of how exactly this should be
done at the time.

The attached patch also adds the missing nodetag from
PartitionClauseInfo and also fixes up other code so as we don't memset
the node memory to zero, as that would destroy the tag. I ended up
just having extract_partition_key_clauses do the makeNode call. This
also resulted in populate_partition_clauses being renamed to
generate_partition_clauses

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

PartitionClauseInfo_reevaluation.patch

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

02 февраля 2018 г., 04:38:07

+/*
+ * PartitionBoundCmpArg - Caller-defined argument to be passed to
+ *                                               partition_bound_cmp()
+ *
+ * The first (fixed) argument involved in a comparison is the partition bound
+ * found in the catalog, while an instance of the following struct describes
+ * either a new partition bound being compared against existing bounds
+ * (caller should set is_bound to true and set bound), or a new tuple's
+ * partition key specified in datums (caller should set ndatums to the number
+ * of valid datums that are passed in the array).
+ */
+typedef struct PartitionBoundCmpArg
+{
+       bool    is_bound;
+       union
+       {
+               PartitionListValue         *lbound;
+               PartitionRangeBound        *rbound;
+               PartitionHashBound         *hbound;
+       }       bound;
+
+       Datum  *datums;
+       int             ndatums;
+} PartitionBoundCmpArg;

This is a pretty strange definition.  datums/ndatums are never valid
at the same time as any of lbound/rbound/hbound, but are not included
in the union.  Also, is_bound doesn't tell you which of
rbound/lbound/hbound is actually valid.  Granted, the current calling
convention looks like a mess, too.  Apparently, the argument to
partition_bound_cmp is a PartitionBoundSpec if using hash
partitioning, a Datum if list partitioning, and either a
PartitionRangeBound or a Datum * if range partitioning depending on
the value of probe_is_bound, and I use the word "apparently" because
there are zero words of comments explaining what the argument to
partition_bound_cmp of type "void *" is supposed to mean.  I really
should have noticed that and insisted that it be fixed before
partitioning got committed.

Looking a bit further, there are exactly two calls to
partition_bound_cmp().  One is in partition_bound_bsearch() and the
other is in check_new_partition_bound(). Now, looking at this, both
the call to partition_bound_cmp() and every single call to
partition_bound_bsearch() are inside a switch branch where we've
dispatched on the partitioning type, which means that from code that
is already specialized by partitioning type we are calling generic
code which then turns back around and calls code that is specialized
by partitioning type.  Now, that could make sense if the generic code
is pretty complex, but here's it's basically just the logic to do a
bsearch.  It seems to me that a cleaner solution here would be to
duplicate that logic.  Then we could have...

static int partition_list_bsearch(PartitionKey key, PartitionBoundInfo
boundinfo,
                        Datum value, bool *is_equal);
static int partition_range_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
                        PartitionRangeBound *probe);
static int partition_range_datum_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
                        int nvalues, Datum *values);
static int partition_hash_bsearch(PartitionKey key, PartitionBoundInfo
boundinfo,
                        int modulus, int remainder, bool *is_equal);

...which would involve fewer branches at runtime and more type-safety
at compile time.  partition_hash_bsearch could directly call
partition_hbound_cmp, partition_list_bsearch could directly invoke
FunctionCall2Coll, partition_range_bsearch could directly call
partition_rbound_cmp, and partition_range_datum_bsearch could directly
call partition_rbound_datum_cmp.

All-in-all that seems a lot nicer to me than what we have here now.
IIUC, the purpose of this patch is to let you search on a prefix of
the partition keys, but I think that's really only possible for range
partitioning, and it seems like the proposed nvalues argument to
partition_range_datum_bsearch would give you what you need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

02 февраля 2018 г., 12:04:17

Thanks for the review.

On 2018/02/02 7:38, Robert Haas wrote:
> +/*
> + * PartitionBoundCmpArg - Caller-defined argument to be passed to
> + *                                               partition_bound_cmp()
> + *
> + * The first (fixed) argument involved in a comparison is the partition bound
> + * found in the catalog, while an instance of the following struct describes
> + * either a new partition bound being compared against existing bounds
> + * (caller should set is_bound to true and set bound), or a new tuple's
> + * partition key specified in datums (caller should set ndatums to the number
> + * of valid datums that are passed in the array).
> + */
> +typedef struct PartitionBoundCmpArg
> +{
> +       bool    is_bound;
> +       union
> +       {
> +               PartitionListValue         *lbound;
> +               PartitionRangeBound        *rbound;
> +               PartitionHashBound         *hbound;
> +       }       bound;
> +
> +       Datum  *datums;
> +       int             ndatums;
> +} PartitionBoundCmpArg;
> 
> This is a pretty strange definition.  datums/ndatums are never valid
> at the same time as any of lbound/rbound/hbound, but are not included
> in the union.  Also, is_bound doesn't tell you which of
> rbound/lbound/hbound is actually valid.  Granted, the current calling
> convention looks like a mess, too.  Apparently, the argument to
> partition_bound_cmp is a PartitionBoundSpec if using hash
> partitioning, a Datum if list partitioning, and either a
> PartitionRangeBound or a Datum * if range partitioning depending on
> the value of probe_is_bound, and I use the word "apparently" because
> there are zero words of comments explaining what the argument to
> partition_bound_cmp of type "void *" is supposed to mean.  I really
> should have noticed that and insisted that it be fixed before
> partitioning got committed.

Yeah, I was trying to fix the status quo by introducing that new struct,
but I agree it's much better to modify the functions around a bit like the
way you describe below.

> Looking a bit further, there are exactly two calls to
> partition_bound_cmp().  One is in partition_bound_bsearch() and the
> other is in check_new_partition_bound(). Now, looking at this, both
> the call to partition_bound_cmp() and every single call to
> partition_bound_bsearch() are inside a switch branch where we've
> dispatched on the partitioning type, which means that from code that
> is already specialized by partitioning type we are calling generic
> code which then turns back around and calls code that is specialized
> by partitioning type.  Now, that could make sense if the generic code
> is pretty complex, but here's it's basically just the logic to do a
> bsearch.  It seems to me that a cleaner solution here would be to
> duplicate that logic.  Then we could have...
> 
> static int partition_list_bsearch(PartitionKey key, PartitionBoundInfo
> boundinfo,
>                         Datum value, bool *is_equal);
> static int partition_range_bsearch(PartitionKey key,
> PartitionBoundInfo boundinfo,
>                         PartitionRangeBound *probe);
> static int partition_range_datum_bsearch(PartitionKey key,
> PartitionBoundInfo boundinfo,
>                         int nvalues, Datum *values);
> static int partition_hash_bsearch(PartitionKey key, PartitionBoundInfo
> boundinfo,
>                         int modulus, int remainder, bool *is_equal);
> 
> ...which would involve fewer branches at runtime and more type-safety
> at compile time.  partition_hash_bsearch could directly call
> partition_hbound_cmp, partition_list_bsearch could directly invoke
> FunctionCall2Coll, partition_range_bsearch could directly call
> partition_rbound_cmp, and partition_range_datum_bsearch could directly
> call partition_rbound_datum_cmp.
> 
> All-in-all that seems a lot nicer to me than what we have here now.
> IIUC, the purpose of this patch is to let you search on a prefix of
> the partition keys, but I think that's really only possible for range
> partitioning, and it seems like the proposed nvalues argument to
> partition_range_datum_bsearch would give you what you need.

Your proposed cleanup sounds much better, so I implemented it in the
attached updated 0001, while dropping the previously proposed
PartitionBoundCmpArg approach.

Updated set of patches attached (patches 0002 onward mostly unchanged,
except I incorporated the delta patch posted by David upthread).

Thanks,
Amit

On 2018/02/06 18:55, Amit Langote wrote:
>> How fast is this patch these days, compared with the current approach?
>>  It would be good to test both when nearly all of the partitions are
>> pruned and when almost none of the partitions are pruned.
> 
> I will include some performance numbers in my next email, which hopefully
> should not be later than Friday this week.

Here is the latest set of patches.  I can see about 2x speedup in planning
time for various partition counts, although it grows linearly as the
partition count grows (same as with HEAD).  Detailed performance figures
follow.

* Partitioned table schema:

H:
create table ht (a int, b int) partition by hash (b);
create table ht_* partition of ht for values with (modulus N, ...)

L:
create table lt (a int, b int) partition by list (b);
create table lt_1 partition of lt for values in (1)
..
create table lt_N partition of lt for values in (N)

R:
create table rt (a int, b int) partition by range (b);
create table rt_1 partition of rt for values from (1) to (<step>)
..
create table rt_N partition of rt for values in (N-1 * <step>) to (N * <step>)

* Queries

Prunes every partition but 1: select * from table_name where b = 1
Prunes none: select * from table_name where b >= 1

* Planning time in milliseconds (average of 5 runs).

On HEAD:

parts H-prune   H-noprune    L-prune   L-noprune   R-prune    R-noprune
   8     1.50        1.42       1.60        1.55      1.77         1.75
  16     2.49        2.37       2.32        2.65      3.29         3.07
  32     3.96        4.49       3.83        4.14      5.06         5.70
  64     8.02        7.51       7.14        7.34      9.37        10.02
 128    14.47       14.19      13.31       13.99     18.09        18.86
 256    24.76       27.63      25.59       27.87     34.15        37.19
 512    50.36       55.92      52.56       54.76     69.34        72.55
1024   102.94      110.59     104.97      110.41    136.89       146.54

Patched:

parts H-prune    H-noprune   L-prune   L-noprune   R-prune   R-noprune
   8     1.49         0.90      0.87        0.74      0.84        1.09
  16     2.01         1.50      1.42        1.68      1.42        1.41
  32     2.63         2.47      2.08        2.69      2.73        2.81
  64     5.62         4.66      4.45        4.96      4.92        5.08
 128    11.28         9.65      9.00        9.60      8.68        9.91
 256    18.36        18.49     17.11       18.39     17.47       18.43
 512    33.88        36.89     34.06       36.52     34.01       37.26
1024    66.40        72.75     66.37       73.40     67.06       67.06

Attached v25 patches.

0001-Modify-bound-comparision-functions-to-accept-mem.patch

This is Ashutosh's patch that he posted on the "advanced partition
matching algorithm for partition-wise join" thread.

0002-Refactor-partition-bound-search-functions.patch

This is similar to 0001.  Whereas 0001 modifies just the comparison
functions, this one modifies the partition bound search functions, because
the main pruning patch uses the search functions.

0003-Add-parttypid-partcollation-partsupfunc-to-Parti.patch

This adds some of the fields to PartitionScheme that were needed by the
main pruning patch.

The above 3 patches do what they do, because we'd like the main pruning to
patch to add its functionality by relying on whatever information is made
available in the partitioned table's RelOptInfo.

0004-Faster-partition-pruning.patch

The main patch that adds src/backend/optimizer/util/partprune.c, a module
to provide the functionality that will replace the current approach of
calling relation_excluded_by_constraints() for each partition.

Sorry, but there is still this big TODO here, which I'll try to fix early
next week.

+ * partprune.c
+ *      Provides functions to prune partitions of a partitioned table by
+ *      comparing provided set of clauses with the table's partitions'
+ *      boundaries
+ *
+ * TODO: write a longer description of things in this file

0005-Add-only-unpruned-partitioned-child-rels-to-part.patch

This one teaches the planner to put *only* the un-pruned partitioned child
tables into partitioned_rels list of certain plan nodes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRctst136uN2BvbWLAkS7w278XmKY4_PUB%2Bxk-%2BNezNq8g%40mail.gmail.com

Вложения

reorganizing partitioning code (was: Re: [HACKERS] path toward fasterpartition pruning)

От

Amit Langote

Дата:

13 февраля 2018 г., 14:47:43

On 2018/02/09 2:58, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
> 
>>> partition.c seems to have two kinds of functions 1. that build and
>>> manage relcache, creates quals from bounds etc. which are metadata
>>> management kind 2. partition bound comparison functions, and other
>>> optimizer related functions. May be we should divide the file that
>>> way. The first category code remains in catalog/ as it is today. The
>>> second catagory functions move to optimizer/.
>>
>> It would be sensible to separate functions that build and manage data
>> in the relcache from other functions.  I think we should consider
>> moving the existing functions of that type from partition.c to
>> src/backend/utils/cache/partcache.c.
> 
> FWIW I've been thinking that perhaps we need some other separation of
> code better than statu quo.  The current partition.c file includes stuff
> for several modules and ISTM all these new patches are making more and
> more of a mess.  So +1 to the general idea of splitting things up.
> Maybe partcache.c is not ambitious enough, but it seems a good first
> step.

Agree with the proposed reorganizing and adding a partcache.c, which I
tried to do in the attached patch.

* The new src/backend/utils/cache/partcache.c contains functions that
initialize relcache's partitioning related fields.  Various partition
bound comparison and search functions (and then some) that work off of the
cached information are moved.  Also, since we cache partition qual,
interface functions RelationGetPartitioQual(Relation) and
get_partition_qual_relid(Oid) are moved too.

* The new src/include/utils/partcache.h contains various struct
definitions that are moved from backend/catalog/partition.c,
include/catalog/partition.h, and include/utils/rel.h.  Also, declarations
of interface functions of partcache.c.

Thoughts?

Thanks,
Amit

Вложения

v1-0001-Reorganize-partitioning-code.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

13 февраля 2018 г., 17:08:59

On 2018/02/09 21:36, Amit Langote wrote:
> 0004-Faster-partition-pruning.patch
> 
> The main patch that adds src/backend/optimizer/util/partprune.c, a module
> to provide the functionality that will replace the current approach of
> calling relation_excluded_by_constraints() for each partition.
> 
> Sorry, but there is still this big TODO here, which I'll try to fix early
> next week.
> 
> + * partprune.c
> + *      Provides functions to prune partitions of a partitioned table by
> + *      comparing provided set of clauses with the table's partitions'
> + *      boundaries
> + *
> + * TODO: write a longer description of things in this file

And I tried to fix that to some degree in the attached updated version.

Thanks,
Amit

On 2018/02/13 20:08, Amit Langote wrote:
> On 2018/02/09 21:36, Amit Langote wrote:
>> 0004-Faster-partition-pruning.patch
>>
>> The main patch that adds src/backend/optimizer/util/partprune.c, a module
>> to provide the functionality that will replace the current approach of
>> calling relation_excluded_by_constraints() for each partition.
>>
>> Sorry, but there is still this big TODO here, which I'll try to fix early
>> next week.
>>
>> + * partprune.c
>> + *      Provides functions to prune partitions of a partitioned table by
>> + *      comparing provided set of clauses with the table's partitions'
>> + *      boundaries
>> + *
>> + * TODO: write a longer description of things in this file
> 
> And I tried to fix that to some degree in the attached updated version.

Here is an updated version.

I realized that 0005 (Add only unpruned partitioned child rels to
partitioned_rels) did that only for (Merge)Append.  That is, it didn't
handle ModifyTable.  I fixed that by teaching inheritance_planner() to do
it.  In the process, I found out that we don't need the
PartitionedChildRelInfo node and related code anymore, so the patch ended
up removing more code than adding it.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

17 февраля 2018 г., 15:24:39

On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 2. PartitionClauseInfo->keyclauses is a list of PartClause which is
>> not a node type. This will cause _copyPartitionClauseInfo() to fail.
>>
>> I'm still not quite sure the best way to fix #2 since PartClause
>> contains a FmgrInfo. I do have a local fix which moves PartClause to
>> primnodes.h and makes it a proper node type. I also added a copy
>> function which does not copy any of the cache fields in PartClause. It
>> just sets valid_cache to false. I didn't particularly think this was
>> the correct fix. I just couldn't think of how exactly this should be
>> done at the time.
>>
>> The attached patch also adds the missing nodetag from
>> PartitionClauseInfo and also fixes up other code so as we don't memset
>> the node memory to zero, as that would destroy the tag. I ended up
>> just having extract_partition_key_clauses do the makeNode call. This
>> also resulted in populate_partition_clauses being renamed to
>> generate_partition_clauses
>
> I started wondering if it's not such a good idea to make
> PartitionClauseInfo a Node at all?  I went back to your earlier message
> [1] where you said that it's put into the Append node for run-time pruning
> to use, but it doesn't sound nice that we'd be putting into the plan
> something that's looks more like scratchpad for the partition.c code.  I
> think we should try to keep PartitionClauseInfo in partition.h and put
> only the list of matched bare clauses into Append.

That sounds like a good idea.

A patch which puts this back is attached.

I've changed the run-time prune patch to process the clause lists
during execution instead.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

make_PartitionClauseInfo_a_nonnode_type.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

17 февраля 2018 г., 15:39:40

            On 15 February 2018 at 18:57, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Here is an updated version.

Thanks for sending v27. I've had a quick look over it while I was
working on the run-time prune patch. However, I've not quite managed a
complete pass of this version yet

A couple of things so far:

1. Following loop;

for (i = 0; i < partnatts; i++)
{
    if (bms_is_member(i, keys->keyisnull))
    {
        /* Only the default partition accepts nulls. */
        if (partition_bound_has_default(boundinfo))
            return bms_make_singleton(boundinfo->default_index);
        else
            return NULL;
    }
}

could become:

if (partition_bound_has_default(boundinfo) &&
   !bms_is_empty(keys->keyisnull)
    return bms_make_singleton(boundinfo->default_index);
else
    return NULL;

2. Is the following form of loop necessary?

for (i = 0; i < partnatts; i++)
{
    if (bms_is_member(i, keys->keyisnull))
    {
        keys->n_eqkeys++;
        keyisnull[i] = true;
    }
}

Can't this just be:

i = -1;
while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
{
    keys->n_eqkeys++;
    keyisnull[i] = true;
}

Perhaps you can just Assert(i < partnatts), if you're worried about that.

Similar code exists in get_partitions_for_keys_range

3. Several comments mention partition_bound_bsearch, but there is now
no such function.

4. "us" should be "is"

* not be any unassigned range to speak of, because the range us unbounded

5. The following code is more complex than it needs to be:

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* could be null.
*/
if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts)
{
    for (i = 0; i < partnatts; i++)
    {
        if (!bms_is_member(i, keys->keyisnotnull))
        {
            include_def = true;
            break;
        }
    }
}


Instead of the for loop, couldn't you just write:

include_def = (bms_num_members(keys->keyisnotnull) < partnatts);

6. The following comment is not well written:

 * get_partitions_excluded_by_ne_datums
 *
 * Returns a Bitmapset of indexes of partitions that can safely be removed
 * due to each such partition's every allowable non-null datum appearing in
 * a <> opeartor clause.

Maybe it would be better to write:

 * get_partitions_excluded_by_ne_datums
 *
 * Returns a Bitmapset of partition indexes that can safely be removed due to
 * the discovery of <> clauses for each datum value allowed in the partition.

if not, then "opeartor" needs the spelling fixed.

7. "The following"

 * Followig entry points exist to this module.

Are there any other .c files where we comment on all the extern
functions in this way? I don't recall seeing it before.

8. The following may as well just: context.partnatts = partnatts;

context.partnatts = rel->part_scheme->partnatts;


9. Why palloc0? Wouldn't palloc be ok?

context.partkeys = (Expr **) palloc0(sizeof(Expr *) *
context.partnatts);

Also, no need for context.partnatts, just partnatts should be fine.

10. I'd rather see bms_next_member() used here:

/* Add selected partitions' RT indexes to result. */
while ((i = bms_first_member(partindexes)) >= 0)
    result = bms_add_member(result, rel->part_rels[i]->relid);

There's not really much use for bms_first_member these days. It can be
slower due to having to traverse the unset lower significant bits each
loop. bms_next_member starts where the previous loop left off.

Will try to review more tomorrow.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

18 февраля 2018 г., 04:54:30

On 17 February 2018 at 22:24, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I started wondering if it's not such a good idea to make
>> PartitionClauseInfo a Node at all?
> That sounds like a good idea.
>
> A patch which puts this back is attached.

Please find attached an updated patch. The previous one must've got a
bit mangled in a bad merge.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

make_PartitionClauseInfo_a_nonnode_type_v2.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

18 февраля 2018 г., 08:25:47

 On 17 February 2018 at 22:39, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> 10. I'd rather see bms_next_member() used here:
>
> /* Add selected partitions' RT indexes to result. */
> while ((i = bms_first_member(partindexes)) >= 0)
>     result = bms_add_member(result, rel->part_rels[i]->relid);
>
> There's not really much use for bms_first_member these days. It can be
> slower due to having to traverse the unset lower significant bits each
> loop. bms_next_member starts where the previous loop left off.
>
> Will try to review more tomorrow.

As I mentioned yesterday, here's the remainder of the review:

11. The following comment is misleading. It says: "We currently handle
two such cases:", then it goes on to say the 2nd case is not handled.

/*
 * Handle cases where the clause's operator does not belong to
 * the partitioning operator family.  We currently handle two
 * such cases: 1. Operators named '<>' are not listed in any
 * operator family whatsoever, 2.  Ordering operators like '<'
 * are not listed in the hash operator families.  For 1, check
 * if list partitioning is in use and if so, proceed to pass
 * the clause to the caller without doing any more processing
 * ourselves.  2 cannot be handled at all, so the clause is
 * simply skipped.
 */

12. The following code should test for LIST partitioning before doing
anything else:

if (!op_in_opfamily(opclause->opno, partopfamily))
{
    Oid negator;

    /*
     * To confirm if the operator is really '<>', check if its
     * negator is a equality operator.  If it's a btree
     * equality operator *and* this is a list partitioned
     * table, we can use it prune partitions.
     */
    negator = get_negator(opclause->opno);
    if (OidIsValid(negator) &&
        op_in_opfamily(negator, partopfamily))
    {
        Oid lefttype;
        Oid righttype;
        int strategy;

        get_op_opfamily_properties(negator, partopfamily,
            false,
           &strategy,
           &lefttype, &righttype);

        if (strategy == BTEqualStrategyNumber &&
            context->strategy == PARTITION_STRATEGY_LIST)
            is_ne_listp = true;
    }

    /* Cannot handle this clause. */
    if (!is_ne_listp)
        continue;
}

The code should probably be in the form of:

if (!op_in_opfamily(opclause->opno, partopfamily))
{
    if (context->strategy != PARTITION_STRATEGY_LIST)
        continue;

    ...

    if (strategy == BTEqualStrategyNumber)
        is_ne_listp = true;
}

that way we'll save 3 syscache lookups when a <> clause appears in a
RANGE or HASH partitioned table.

13. The following code makes assumptions that the partitioning op
family is btree:

/*
 * In case of NOT IN (..), we get a '<>', which while not
 * listed as part of any operator family, we are able to
 * handle it if its negator is an equality operator that
 * is in turn part of the operator family.
 */
if (!op_in_opfamily(saop_op, partopfamily))
{
    Oid negator = get_negator(saop_op);
    int strategy;
    Oid lefttype,
    righttype;

    if (!OidIsValid(negator))
        continue;
    get_op_opfamily_properties(negator, partopfamily, false,
      &strategy,
       &lefttype, &righttype);
    if (strategy != BTEqualStrategyNumber)
        continue;
}

this might not be breakable today, but it could well break in the
future, for example, if hash op family managed to grow two more
strategies, then we could get a false match on the matching strategy
numbers (both 3).

14. The following code assumes there will be a right op:

if (IsA(clause, OpExpr))
{
    OpExpr    *opclause = (OpExpr *) clause;
    Expr    *leftop,
      *rightop,
      *valueexpr;
    bool is_ne_listp = false;

    leftop = (Expr *) get_leftop(clause);
    if (IsA(leftop, RelabelType))
        leftop = ((RelabelType *) leftop)->arg;
    rightop = (Expr *) get_rightop(clause);

This'll crash with the following:

create function nonzero(p int) returns bool as $$ begin return (p <>
0); end;$$ language plpgsql;
create operator # (procedure = nonzero, leftarg = int);
create table listp (a int) partition by list (a);
create table listp1 partition of listp for values in(1);
select * from listp where a#;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

You need to ensure that there are 2 args.

15. Various if tests in extract_partition_clauses result in a
`continue` when they should perhaps be a `break` instead.

For example:

if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid))
    continue;

This clause does not depend on the partition key, so there's no point
in trying to match this again for the next partition key.

This item is perhaps not so important, as it's only a small
inefficiency, but I just wanted to point it out. Also, note that plain
Var partition keys cannot be duplicated, but expressions can, so there
may be cases that you don't want to change to `break`

Other conditions which possibly should change to `break` instead of
`continue` include:

/* Only IS [NOT] TRUE/FALSE are any good to us */
if (btest->booltesttype == IS_UNKNOWN ||
btest->booltesttype == IS_NOT_UNKNOWN)
continue;

16. The following code looks a bit fragile:

leftop = IsA(clause, Var)
? (Expr *) clause
: (Expr *) get_notclausearg((Expr *) clause);

This appears to assume that the partition key will be a plain Var and
not an expression. I tried to break this with:

create table bp (a bool, b bool) partition by ((a < b));
create table bp_true partition of bp for values in('t');
explain select * from bp where (a < b);

however, naturally, the parser builds an OpExpr instead of a
BooleanTest for this case. If it had built a BooleanTest, then the
above code would mistakenly call get_notclausearg on the (a < b) Expr.

Do you have reason to believe that the code is safe and a good idea?

17. Which relation is the comment talking about?

/*
 * get_partitions_from_args
 *
 * Returns the set of partitions of relation, each of which satisfies some
 * clause in or_args.
 */
static Bitmapset *
get_partitions_from_or_args(PartitionPruneContext *context, List *or_args)

18. "sets a field", would it not be better to mention constfalse?:

 * returns right after finding such a clause and before returning, sets a field
 * in context->clauseinfo to inform the caller that we found such clause.

19. "clauses"

* partitioning, we don't require all of eqkeys to be operator clausses.

20. There does not seem to be a need to palloc0 here. palloc seems fine.

keys->ne_datums = (Datum *)
palloc0(list_length(clauseinfo->ne_clauses) *
sizeof(Datum));

This, of course, may leave unset memory in any unused items, but you
never iterate beyond what n_ne_datums gets set to anyway, so I don't
see the need to zero any extra elements.

21. A code comment should be added to the following code to mention
that these are not arrays indexed by partition key as they're only
ever used for LIST partitioning, which only supports a single key.

/* Datum values from clauses containing <> operator */
Datum  *ne_datums;
int n_ne_datums;

22. Can you include: "'keyisnotnull' may also be set for the given
partition key when a strict OpExpr is encountered" in the following
comment?

 * Based on a IS NULL or IS NOT NULL clause that was matched to a partition
 * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

19 февраля 2018 г., 10:59:28

Hi David.

On 2018/02/17 18:24, David Rowley wrote:
> On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I started wondering if it's not such a good idea to make
>> PartitionClauseInfo a Node at all?  I went back to your earlier message
>> [1] where you said that it's put into the Append node for run-time pruning
>> to use, but it doesn't sound nice that we'd be putting into the plan
>> something that's looks more like scratchpad for the partition.c code.  I
>> think we should try to keep PartitionClauseInfo in partition.h and put
>> only the list of matched bare clauses into Append.
> 
> That sounds like a good idea.
> 
> A patch which puts this back is attached.
> 
> I've changed the run-time prune patch to process the clause lists
> during execution instead.

Thank you.  I'll incorporate it in the version I'll send next.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

19 февраля 2018 г., 15:19:29

Hi David.

Thanks a lot for the review comments.  Replying to all of your comments.

On 2018/02/17 18:39, David Rowley wrote:
> On 15 February 2018 at 18:57, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Here is an updated version.
> 
> Thanks for sending v27. I've had a quick look over it while I was
> working on the run-time prune patch. However, I've not quite managed a
> complete pass of this version yet
> 
> A couple of things so far:
> 
> 1. Following loop;
> 
> for (i = 0; i < partnatts; i++)
> {
>     if (bms_is_member(i, keys->keyisnull))
>     {
>         /* Only the default partition accepts nulls. */
>         if (partition_bound_has_default(boundinfo))
>             return bms_make_singleton(boundinfo->default_index);
>         else
>             return NULL;
>     }
> }
> 
> could become:
> 
> if (partition_bound_has_default(boundinfo) &&
>    !bms_is_empty(keys->keyisnull)
>     return bms_make_singleton(boundinfo->default_index);
> else
>     return NULL;
> 
> 2. Is the following form of loop necessary?
> 
> for (i = 0; i < partnatts; i++)
> {
>     if (bms_is_member(i, keys->keyisnull))
>     {
>         keys->n_eqkeys++;
>         keyisnull[i] = true;
>     }
> }
> 
> Can't this just be:
> 
> i = -1;
> while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
> {
>     keys->n_eqkeys++;
>     keyisnull[i] = true;
> }
> 
> Perhaps you can just Assert(i < partnatts), if you're worried about that.
> 
> Similar code exists in get_partitions_for_keys_range

Both 1 and 2 are good suggestions, so done that way.

> 
> 3. Several comments mention partition_bound_bsearch, but there is now
> no such function.
> 
> 4. "us" should be "is"
> 
> * not be any unassigned range to speak of, because the range us unbounded

Fixed.

> 5. The following code is more complex than it needs to be:
> 
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * could be null.
> */
> if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts)
> {
>     for (i = 0; i < partnatts; i++)
>     {
>         if (!bms_is_member(i, keys->keyisnotnull))
>         {
>             include_def = true;
>             break;
>         }
>     }
> }
> 
> 
> Instead of the for loop, couldn't you just write:
> 
> include_def = (bms_num_members(keys->keyisnotnull) < partnatts);

Indeed, it's much simpler.  Though, I wrote it as:

+    if (bms_num_members(keys->keyisnotnull) < partnatts)
+        include_def = true;

> 6. The following comment is not well written:
> 
>  * get_partitions_excluded_by_ne_datums
>  *
>  * Returns a Bitmapset of indexes of partitions that can safely be removed
>  * due to each such partition's every allowable non-null datum appearing in
>  * a <> opeartor clause.
> 
> Maybe it would be better to write:
> 
>  * get_partitions_excluded_by_ne_datums
>  *
>  * Returns a Bitmapset of partition indexes that can safely be removed due to
>  * the discovery of <> clauses for each datum value allowed in the partition.
> 
> if not, then "opeartor" needs the spelling fixed.

Sure, your rewrite sounds much better.

> 7. "The following"
> 
>  * Followig entry points exist to this module.

Fixed.

> 
> Are there any other .c files where we comment on all the extern
> functions in this way? I don't recall seeing it before.

Hmm, not like the way this patch does, but some files in the executor do
have such introductory comments using description of functions exported by
the module.  See for example, nodeSeqscan.c, execMain.c, etc.

Not saying that this is the best way to introduce the module, but this is
the one I went with for now.  If this format is not very informative, I'm
willing to rewrite it some other way.

> 8. The following may as well just: context.partnatts = partnatts;
> 
> context.partnatts = rel->part_scheme->partnatts;
>
> 9. Why palloc0? Wouldn't palloc be ok?
> 
> context.partkeys = (Expr **) palloc0(sizeof(Expr *) *
> context.partnatts);
> 
> Also, no need for context.partnatts, just partnatts should be fine.

Fixed both.  Yes, palloc suffices.

> 10. I'd rather see bms_next_member() used here:
> 
> /* Add selected partitions' RT indexes to result. */
> while ((i = bms_first_member(partindexes)) >= 0)
>     result = bms_add_member(result, rel->part_rels[i]->relid);
> 
> There's not really much use for bms_first_member these days. It can be
> slower due to having to traverse the unset lower significant bits each
> loop. bms_next_member starts where the previous loop left off.

Thanks for clarifying.  Used bms_next_member().

> Will try to review more tomorrow.
>
On 2018/02/18 11:25, David Rowley wrote:
> As I mentioned yesterday, here's the remainder of the review:
> 
> 11. The following comment is misleading. It says: "We currently handle
> two such cases:", then it goes on to say the 2nd case is not handled.
> 
> /*
>  * Handle cases where the clause's operator does not belong to
>  * the partitioning operator family.  We currently handle two
>  * such cases: 1. Operators named '<>' are not listed in any
>  * operator family whatsoever, 2.  Ordering operators like '<'
>  * are not listed in the hash operator families.  For 1, check
>  * if list partitioning is in use and if so, proceed to pass
>  * the clause to the caller without doing any more processing
>  * ourselves.  2 cannot be handled at all, so the clause is
>  * simply skipped.
>  */

You're right.  We don't really "handle" 2.  Rewrote the comment like this:

+             * Normally we only bother with operators that are listed as
+             * being part of the partitioning operator family.  But we
+             * make an exception in one case -- operators named '<>' are
+             * not listed in any operator family whatsoever, in which
+             * case, we try to perform partition pruning with it only if
+             * list partitioning is in use.

> 
> 12. The following code should test for LIST partitioning before doing
> anything else:
> 
> if (!op_in_opfamily(opclause->opno, partopfamily))
> {
>     Oid negator;
> 
>     /*
>      * To confirm if the operator is really '<>', check if its
>      * negator is a equality operator.  If it's a btree
>      * equality operator *and* this is a list partitioned
>      * table, we can use it prune partitions.
>      */
>     negator = get_negator(opclause->opno);
>     if (OidIsValid(negator) &&
>         op_in_opfamily(negator, partopfamily))
>     {
>         Oid lefttype;
>         Oid righttype;
>         int strategy;
> 
>         get_op_opfamily_properties(negator, partopfamily,
>             false,
>            &strategy,
>            &lefttype, &righttype);
> 
>         if (strategy == BTEqualStrategyNumber &&
>             context->strategy == PARTITION_STRATEGY_LIST)
>             is_ne_listp = true;
>     }
> 
>     /* Cannot handle this clause. */
>     if (!is_ne_listp)
>         continue;
> }
> 
> The code should probably be in the form of:
> 
> if (!op_in_opfamily(opclause->opno, partopfamily))
> {
>     if (context->strategy != PARTITION_STRATEGY_LIST)
>         continue;
> 
>     ...
> 
>     if (strategy == BTEqualStrategyNumber)
>         is_ne_listp = true;
> }
> 
> that way we'll save 3 syscache lookups when a <> clause appears in a
> RANGE or HASH partitioned table.

Good idea, done.

> 13. The following code makes assumptions that the partitioning op
> family is btree:
> 
> /*
>  * In case of NOT IN (..), we get a '<>', which while not
>  * listed as part of any operator family, we are able to
>  * handle it if its negator is an equality operator that
>  * is in turn part of the operator family.
>  */
> if (!op_in_opfamily(saop_op, partopfamily))
> {
>     Oid negator = get_negator(saop_op);
>     int strategy;
>     Oid lefttype,
>     righttype;
> 
>     if (!OidIsValid(negator))
>         continue;
>     get_op_opfamily_properties(negator, partopfamily, false,
>       &strategy,
>        &lefttype, &righttype);
>     if (strategy != BTEqualStrategyNumber)
>         continue;
> }
> 
> this might not be breakable today, but it could well break in the
> future, for example, if hash op family managed to grow two more
> strategies, then we could get a false match on the matching strategy
> numbers (both 3).

I guess we'd be able to anything useful with a NOT IN (..) only with list
partitioning, so I changed the code and surrounding comments like I did
per your comment above.  So, if we process it only if list partitioning is
used, the existing code can safely assume that partitioning operator
family is btree.

> 14. The following code assumes there will be a right op:
> 
> if (IsA(clause, OpExpr))
> {
>     OpExpr    *opclause = (OpExpr *) clause;
>     Expr    *leftop,
>       *rightop,
>       *valueexpr;
>     bool is_ne_listp = false;
> 
>     leftop = (Expr *) get_leftop(clause);
>     if (IsA(leftop, RelabelType))
>         leftop = ((RelabelType *) leftop)->arg;
>     rightop = (Expr *) get_rightop(clause);
> 
> This'll crash with the following:
> 
> create function nonzero(p int) returns bool as $$ begin return (p <>
> 0); end;$$ language plpgsql;
> create operator # (procedure = nonzero, leftarg = int);
> create table listp (a int) partition by list (a);
> create table listp1 partition of listp for values in(1);
> select * from listp where a#;
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> 
> You need to ensure that there are 2 args.

Oops, so we can't prune with unary operator clause.  Added the check on
number of args.

> 15. Various if tests in extract_partition_clauses result in a
> `continue` when they should perhaps be a `break` instead.
> 
> For example:
> 
> if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid))
>     continue;
> 
> This clause does not depend on the partition key, so there's no point
> in trying to match this again for the next partition key.
> 
> This item is perhaps not so important, as it's only a small
> inefficiency, but I just wanted to point it out. Also, note that plain
> Var partition keys cannot be duplicated, but expressions can, so there
> may be cases that you don't want to change to `break`
> 
> Other conditions which possibly should change to `break` instead of
> `continue` include:
> 
> /* Only IS [NOT] TRUE/FALSE are any good to us */
> if (btest->booltesttype == IS_UNKNOWN ||
> btest->booltesttype == IS_NOT_UNKNOWN)
> continue;

You're quite right.  Replaced all those continue's with break.

> 16. The following code looks a bit fragile:
> 
> leftop = IsA(clause, Var)
> ? (Expr *) clause
> : (Expr *) get_notclausearg((Expr *) clause);
> 
> This appears to assume that the partition key will be a plain Var and
> not an expression. I tried to break this with:
> 
> create table bp (a bool, b bool) partition by ((a < b));
> create table bp_true partition of bp for values in('t');
> explain select * from bp where (a < b);
> 
> however, naturally, the parser builds an OpExpr instead of a
> BooleanTest for this case. If it had built a BooleanTest, then the
> above code would mistakenly call get_notclausearg on the (a < b) Expr.
> 
> Do you have reason to believe that the code is safe and a good idea?

Thanks for the example.  I think there were problems here.  The way the
current code matched these specially shaped clauses with a Boolean
partition key wouldn't have resulted in correct pruning for your example
and perhaps more cases.

I looked at match_boolean_index_clause() and decided that we needed
something similar here.  So, in the updated patch, I've added a
match_boolean_partition_clause(), which will be called at the beginning of
the loop to recognize such specially shaped clauses as applicable to a
Boolean partition key and add them to the set of pruning clauses.  With
the new code:

create table bp (a int, b int) partition by list ((a < b));
create table bp_true partition of bp for values in ('true');
create table bp_false partition of bp for values in ('false');

explain select * from bp where (a < b) is true;
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: ((a < b) IS TRUE)
(3 rows)

explain select * from bp where (a < b) is false;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=1507 width=8)
   ->  Seq Scan on bp_false  (cost=0.00..38.25 rows=1507 width=8)
         Filter: ((a < b) IS FALSE)
(3 rows)

explain select * from bp where (a < b) is not false;
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: ((a < b) IS NOT FALSE)
(3 rows)

explain select * from bp where (a < b) = true;
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a < b)
(3 rows)

explain select * from bp where (a < b) = false;
                           QUERY PLAN
-----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_false  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a >= b)
(3 rows)

explain select * from bp where (a < b);
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a < b)
(3 rows)

explain select * from bp where not (a < b);
                           QUERY PLAN
-----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_false  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a >= b)
(3 rows)

> 17. Which relation is the comment talking about?
> 
> /*
>  * get_partitions_from_args
>  *
>  * Returns the set of partitions of relation, each of which satisfies some
>  * clause in or_args.
>  */
> static Bitmapset *
> get_partitions_from_or_args(PartitionPruneContext *context, List *or_args)

Fixed the comment.

> 18. "sets a field", would it not be better to mention constfalse?:
> 
>  * returns right after finding such a clause and before returning, sets a field
>  * in context->clauseinfo to inform the caller that we found such clause.

You're right.

> 19. "clauses"
> 
> * partitioning, we don't require all of eqkeys to be operator clausses.

Fixed.

> 20. There does not seem to be a need to palloc0 here. palloc seems fine.
> 
> keys->ne_datums = (Datum *)
> palloc0(list_length(clauseinfo->ne_clauses) *
> sizeof(Datum));
>
> This, of course, may leave unset memory in any unused items, but you
> never iterate beyond what n_ne_datums gets set to anyway, so I don't
> see the need to zero any extra elements.

Agreed about using palloc.

> 21. A code comment should be added to the following code to mention
> that these are not arrays indexed by partition key as they're only
> ever used for LIST partitioning, which only supports a single key.
> 
> /* Datum values from clauses containing <> operator */
> Datum  *ne_datums;
> int n_ne_datums;

OK, done.  Also adjusted the comment above the struct's definition.

> 22. Can you include: "'keyisnotnull' may also be set for the given
> partition key when a strict OpExpr is encountered" in the following
> comment?
> 
>  * Based on a IS NULL or IS NOT NULL clause that was matched to a partition
>  * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set.

Done.

Attached updated patches.  Thanks again!

Regards,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

19 февраля 2018 г., 19:40:46

On 19 February 2018 at 22:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.  Thanks again!

Thanks for making those changes.  I've made another pass of v28 and
have a few more comments.

The patch is starting to look good, but there are some new changes in
recent versions which still don't look quite right.

1. This does not fully make sense:

/*
 * Remove the indexes of partitions excluded due to each of
 * those partitions' *all* of allowed datums appearing in
 * keys->ne_datums, that is compared to the partition key
 * using <> operator.
 */

 "each of those partitions' *all* of allowed" is not correct.

Maybe better to write:

/*
 * Remove the indexes of any partitions which cannot possibly
 * contain rows matching the clauses due to key->ne_datums containing
 * all datum values which are allowed in the given partition.  This
 * is only possible to do in LIST partitioning as it's the only
 * partitioning strategy which allows the specification of exact values.
 */

2. Mine does not, but some compilers may complain about
get_partitions_for_keys() result variable being uninitalised in
get_partitions_for_keys. Probably the easiest fix would be to just
assign to NULL in the default case.

3. Did you mean to put this Assert() inside the loop?

memset(keyisnull, false, sizeof(keyisnull));
i = -1;
while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
{
keys->n_eqkeys++;
keyisnull[i] = true;
}
Assert(i < partnatts);

i will always be -2 at the end of the loop. Seems like a useless
Assert in its current location.

4. Can you add a comment here to say: "Note: LIST partitioning only
supports a single partition key, therefore this function requires no
looping over the partition keys"

/*
 * get_partitions_for_keys_list
 * Return partitions of a list partitioned table for requested keys
 *
 * This interprets the keys and looks up partitions in the partition bound
 * descriptor using the list partitioning semantics.
 */

5. The following comment contains a bit of duplication to the comment
which comes after it. Maybe the following:

/*
* If the query is looking for null keys, there can only be one such
* partition.  Return the same if one exists.
*/

can be changed to:

/* Handle clauses requesting a NULL valued partition key */

6. The following comment does not quite make sense:

/* Exactly matching datum exists. */

Probably better to write:

/* An exact matching datum exists. */

7. "If it's not equal (<)" I think you mean (>), not (<), in:

 * The bound at minoff is <= minkeys, given the way
 * partition_list_bsearch() works.  If it's not equal (<), then
 * increment minoff to make it point to the datum on the right
 * that necessarily satisfies minkeys.  Also do the same if it is
 * equal but minkeys is exclusive.

However, the comment is a bit clumsy. Maybe the following is better?

/*
 * partition_list_bsearch returning a positive number means that
 * minkeys[0] must be greater than or equal to the smallest datum.
 * If we didn't find an exact matching datum (!is_equal) or if the
 * operator used was non-inclusive (>), then in both of these
 * cases we're not interested in the datum pointed to by minoff,
 * but we may start getting matches in the partition which the
 * next datum belongs to, so point to that one instead.  (This may
 * be beyond the last datum in the array, but we'll detect that
 * later.)
 */

8. The following comment could be improved:

* minkeys is greater than the datums of all non-default partitions,
* meaning there isn't one to return.  Return the default partition if
* one exists.

how about:

* The value of minkeys[0] is greater than all of the datums we have
* partitions for.  The only possible partition that could contain a
* match is the default partition.  Return that, if it exists.

9. The following could also be improved:

* The bound at maxoff is <= maxkeys, given the way
* partition_list_bsearch works.  If the bound at maxoff exactly
* matches maxkey (is_equal), but the maxkey is exclusive, then
* decrement maxoff to point to the bound on the left.

how about:

* partition_list_bsearch returning a positive number means that
* maxkeys[0] must be greater than or equal to the smallest datum.
* If the match found is an equal match, but the operator used is
* non-inclusive of that value (<), then the partition belonging
* to maxoff cannot match, so we'll decrement maxoff to point to
* the partition belonging to the previous datum.  We might end up
* decrementing maxoff down to -1, but we'll handle that later.

10. Can you append "  This may not technically be true for some data
types (e.g. integer types), however, we currently lack any sort of
infrastructure to provide us with proofs that would allow us to do
anything smarter here." to:

* For range queries, always include the default list partition,
* because list partitions divide the key space in a discontinuous
* manner, not all values in the given range will have a partition
* assigned.

11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1",
but get_partitions_for_keys_list likes to "minoff--", can this be made
the same? Personally, I like -- over -= 1 as it's shorter. Although I
do remember having an argument with my university professor about
this. He claimed -= 1 was clearer... I'm still unsure what he found so
confusing about -- ...

12. The following code could be optimised a little for the case when
there's no default:

/*
* There may exist a range of values unassigned to any non-default
* partition between the datums at minoff and maxoff.
*/
for (i = minoff; i <= maxoff; i++)
{
if (boundinfo->indexes[i] < 0)
{
include_def = true;
break;
}
}

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* *could* be null.
*/
if (bms_num_members(keys->keyisnotnull) < partnatts)
include_def = true;

if (include_def && partition_bound_has_default(boundinfo))
result = bms_add_member(result, boundinfo->default_index);

return result;

Maybe something more like:

if (!partition_bound_has_default(boundinfo))
    return result;

/*
* There may exist a range of values unassigned to any non-default
* partition between the datums at minoff and maxoff.
*/
for (i = minoff; i <= maxoff; i++)
{
    if (boundinfo->indexes[i] < 0)
        return bms_add_member(result, boundinfo->default_index);
}

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* *could* be null.
*/
if (bms_num_members(keys->keyisnotnull) < partnatts)
    return bms_add_member(result, boundinfo->default_index);

Which is saves a bit of needless work when there's no default to add,
and also saves a few lines, including the line where you declare the
include_def variable.

13. Variable name:

Bitmapset  *partitioned_rels_bms = NULL;

This should likely be called partitioned_relids, and be of type Relids
instead of Bitmapset.

14. This line removal seems surplus. It should probably be fixed
independently of this patch.

  parent_roots[appinfo->child_relid] = subroot;
-
  continue;

15. I'm unsure how safe the following code is:

while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0)
    partitioned_rels = lappend_int(partitioned_rels, parent_rti);

You're now putting this list in ascending order of relid, but some
code in set_plan_refs assumes the root partition is the first element:

root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));

By luck, the first element might today be the root due to the way we
expand the inheritance hierarchy, but I think this code is wrong to
rely on that.

I'm not really a fan of having the root partition be the first element
in the List. I would much rather see a Relids type and a special Index
field for the root, but that might be more changes that you'd like to
make here. I just don't think what you have now is correct.

16. This should probably return Relids rather than Bitmapset *.

Bitmapset *
prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)

Please update the mention of Bitmapset in the comment at the top of
the function too.

17. hmm, my patch did palloc(), not palloc0(). My original patch was
broken and missed this, but v2 got it.

context->clauseinfo = partclauseinfo =
palloc0(sizeof(PartitionClauseInfo));

There's no need to palloc0() here. You're setting all the fields to
zero just below. As far as I understand it, it's only Node types that
we have to go through the rigmarole of doing both.

18. I tentatively agree with you having changed the continue to break
in the following:

/* We can't use any volatile value to prune partitions. */
if (contain_volatile_functions((Node *) valueexpr))
break;

I believe it's not wrong to break here, but keep in mind you're
testing valueexpr rather than something with the OpExpr itself. The
reason I'm not saying this is wrong is that if the valueexpr is
volatile then it cannot possibly match another partition key anyway,
so there's likely no point in continuing to look for another match...
You should likely write a comment to explain this a bit. I think all
of the other places you've changed to break look fine. The
ScalarArrayOpExpr volatile function test is fine to break from since
the operands cannot be reversed in that case, so rightop certainly
can't match a partition key.

19. I might have caused this, but there's no such variable as 'cur'

/* cur is more restrictive, so replace the existing. */

20. Is there a difference between
partition_bound_has_default(context->boundinfo) and
context->has_default_part? Any reason for both? Your code uses both. I
don't yet understand why has_default_part was added.

That's all I have for now.

It's getting close. Good work!

I keep having to turn up my strictness level each time I review.  I'd
like to be able to turn it up all the way earlier, but I fear I may
still be reviewing v1 if I'd done that :-)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

20 февраля 2018 г., 13:04:23

Hi David.

Thanks for the review.

On 2018/02/19 22:40, David Rowley wrote:
> On 19 February 2018 at 22:19, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated patches.  Thanks again!
> 
> Thanks for making those changes.  I've made another pass of v28 and
> have a few more comments.
> 
> The patch is starting to look good, but there are some new changes in
> recent versions which still don't look quite right.
> 
> 1. This does not fully make sense:
> 
> /*
>  * Remove the indexes of partitions excluded due to each of
>  * those partitions' *all* of allowed datums appearing in
>  * keys->ne_datums, that is compared to the partition key
>  * using <> operator.
>  */
> 
>  "each of those partitions' *all* of allowed" is not correct.
> 
> Maybe better to write:
> 
> /*
>  * Remove the indexes of any partitions which cannot possibly
>  * contain rows matching the clauses due to key->ne_datums containing
>  * all datum values which are allowed in the given partition.  This
>  * is only possible to do in LIST partitioning as it's the only
>  * partitioning strategy which allows the specification of exact values.
>  */

Ah, your rewrite sounds much better.

> 2. Mine does not, but some compilers may complain about
> get_partitions_for_keys() result variable being uninitalised in
> get_partitions_for_keys. Probably the easiest fix would be to just
> assign to NULL in the default case.

Done.

> 3. Did you mean to put this Assert() inside the loop?
> 
> memset(keyisnull, false, sizeof(keyisnull));
> i = -1;
> while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
> {
> keys->n_eqkeys++;
> keyisnull[i] = true;
> }
> Assert(i < partnatts);
>
> i will always be -2 at the end of the loop. Seems like a useless
> Assert in its current location.

Face palm!  Fixed.

> 4. Can you add a comment here to say: "Note: LIST partitioning only
> supports a single partition key, therefore this function requires no
> looping over the partition keys"
> 
> /*
>  * get_partitions_for_keys_list
>  * Return partitions of a list partitioned table for requested keys
>  *
>  * This interprets the keys and looks up partitions in the partition bound
>  * descriptor using the list partitioning semantics.
>  */

OK, done.

> 5. The following comment contains a bit of duplication to the comment
> which comes after it. Maybe the following:
> 
> /*
> * If the query is looking for null keys, there can only be one such
> * partition.  Return the same if one exists.
> */
> 
> can be changed to:
> 
> /* Handle clauses requesting a NULL valued partition key */

You're right.  Changed.

> 6. The following comment does not quite make sense:
> 
> /* Exactly matching datum exists. */
> 
> Probably better to write:
> 
> /* An exact matching datum exists. */

OK, done.

> 7. "If it's not equal (<)" I think you mean (>), not (<), in:
> 
>  * The bound at minoff is <= minkeys, given the way
>  * partition_list_bsearch() works.  If it's not equal (<), then
>  * increment minoff to make it point to the datum on the right
>  * that necessarily satisfies minkeys.  Also do the same if it is
>  * equal but minkeys is exclusive.
> 
> However, the comment is a bit clumsy. Maybe the following is better?
> 
> /*
>  * partition_list_bsearch returning a positive number means that
>  * minkeys[0] must be greater than or equal to the smallest datum.
>  * If we didn't find an exact matching datum (!is_equal) or if the
>  * operator used was non-inclusive (>), then in both of these
>  * cases we're not interested in the datum pointed to by minoff,
>  * but we may start getting matches in the partition which the
>  * next datum belongs to, so point to that one instead.  (This may
>  * be beyond the last datum in the array, but we'll detect that
>  * later.)
>  */

Your rewrite is much better.

> 8. The following comment could be improved:
> 
> * minkeys is greater than the datums of all non-default partitions,
> * meaning there isn't one to return.  Return the default partition if
> * one exists.
> 
> how about:
> 
> * The value of minkeys[0] is greater than all of the datums we have
> * partitions for.  The only possible partition that could contain a
> * match is the default partition.  Return that, if it exists.

OK, adopted your rewrite.

> 9. The following could also be improved:
> 
> * The bound at maxoff is <= maxkeys, given the way
> * partition_list_bsearch works.  If the bound at maxoff exactly
> * matches maxkey (is_equal), but the maxkey is exclusive, then
> * decrement maxoff to point to the bound on the left.
> 
> how about:
> 
> * partition_list_bsearch returning a positive number means that
> * maxkeys[0] must be greater than or equal to the smallest datum.
> * If the match found is an equal match, but the operator used is
> * non-inclusive of that value (<), then the partition belonging
> * to maxoff cannot match, so we'll decrement maxoff to point to
> * the partition belonging to the previous datum.  We might end up
> * decrementing maxoff down to -1, but we'll handle that later.

OK, done.

> 10. Can you append "  This may not technically be true for some data
> types (e.g. integer types), however, we currently lack any sort of
> infrastructure to provide us with proofs that would allow us to do
> anything smarter here." to:
> 
> * For range queries, always include the default list partition,
> * because list partitions divide the key space in a discontinuous
> * manner, not all values in the given range will have a partition
> * assigned.

Hmm, that seems to make sense, so added the text.

I guess you know it already, but I'm trying to say in the comment that
list partitioning, by definition, does not force you to enumerate all
values that a data type may specify to exist.

> 11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1",
> but get_partitions_for_keys_list likes to "minoff--", can this be made
> the same? Personally, I like -- over -= 1 as it's shorter. Although I
> do remember having an argument with my university professor about
> this. He claimed -= 1 was clearer... I'm still unsure what he found so
> confusing about -- ...

I will go with minoff-- for consistency as you say.

> 12. The following code could be optimised a little for the case when
> there's no default:
> 
> /*
> * There may exist a range of values unassigned to any non-default
> * partition between the datums at minoff and maxoff.
> */
> for (i = minoff; i <= maxoff; i++)
> {
> if (boundinfo->indexes[i] < 0)
> {
> include_def = true;
> break;
> }
> }
> 
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * *could* be null.
> */
> if (bms_num_members(keys->keyisnotnull) < partnatts)
> include_def = true;
> 
> if (include_def && partition_bound_has_default(boundinfo))
> result = bms_add_member(result, boundinfo->default_index);
> 
> return result;
> 
> Maybe something more like:
> 
> if (!partition_bound_has_default(boundinfo))
>     return result;
> 
> /*
> * There may exist a range of values unassigned to any non-default
> * partition between the datums at minoff and maxoff.
> */
> for (i = minoff; i <= maxoff; i++)
> {
>     if (boundinfo->indexes[i] < 0)
>         return bms_add_member(result, boundinfo->default_index);
> }
> 
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * *could* be null.
> */
> if (bms_num_members(keys->keyisnotnull) < partnatts)
>     return bms_add_member(result, boundinfo->default_index);
> 
> Which is saves a bit of needless work when there's no default to add,
> and also saves a few lines, including the line where you declare the
> include_def variable.

Ah, that's neat.  Done that way.  I also changed other code that sets
include_def to instead add the default partition index to result right
away, instead of just setting include_def.  So, include_def is now gone.

> 13. Variable name:
> 
> Bitmapset  *partitioned_rels_bms = NULL;
> 
> This should likely be called partitioned_relids, and be of type Relids
> instead of Bitmapset.

Makes sense, done.

> 14. This line removal seems surplus. It should probably be fixed
> independently of this patch.
> 
>   parent_roots[appinfo->child_relid] = subroot;
> -
>   continue;

Hadn't noticed that.  Fixed.

> 15. I'm unsure how safe the following code is:
> 
> while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0)
>     partitioned_rels = lappend_int(partitioned_rels, parent_rti);
> 
> You're now putting this list in ascending order of relid, but some
> code in set_plan_refs assumes the root partition is the first element:
> 
> root->glob->rootResultRelations =
> lappend_int(root->glob->rootResultRelations,
> linitial_int(splan->partitioned_rels));
> 
> By luck, the first element might today be the root due to the way we
> expand the inheritance hierarchy, but I think this code is wrong to
> rely on that.
> 
> I'm not really a fan of having the root partition be the first element
> in the List. I would much rather see a Relids type and a special Index
> field for the root, but that might be more changes that you'd like to
> make here. I just don't think what you have now is correct.

Hmm, so you're saying that it's not future-proof for the code here to
assume that the root table will always get the smallest RT index of the
tables in a given partition tree.  I guess it is a concern that exists
independently of the changes this patch makes, but I agree with the
concern.  We can write another patch to break that assumption by using the
method you suggest of adding a Index variable to the ModifyTable node to
store the root table index.

> 16. This should probably return Relids rather than Bitmapset *.
> 
> Bitmapset *
> prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)
> 
> Please update the mention of Bitmapset in the comment at the top of
> the function too.

OK, done.

> 17. hmm, my patch did palloc(), not palloc0(). My original patch was
> broken and missed this, but v2 got it.
> 
> context->clauseinfo = partclauseinfo =
> palloc0(sizeof(PartitionClauseInfo));
> 
> There's no need to palloc0() here. You're setting all the fields to
> zero just below. As far as I understand it, it's only Node types that
> we have to go through the rigmarole of doing both.

Ah, you're right.  Used palloc.

> 18. I tentatively agree with you having changed the continue to break
> in the following:
> 
> /* We can't use any volatile value to prune partitions. */
> if (contain_volatile_functions((Node *) valueexpr))
> break;
> 
> I believe it's not wrong to break here, but keep in mind you're
> testing valueexpr rather than something with the OpExpr itself. The
> reason I'm not saying this is wrong is that if the valueexpr is
> volatile then it cannot possibly match another partition key anyway,
> so there's likely no point in continuing to look for another match...
> You should likely write a comment to explain this a bit I think all> of the other places you've changed to break look
fine.The
 
> ScalarArrayOpExpr volatile function test is fine to break from since
> the operands cannot be reversed in that case, so rightop certainly
> can't match a partition key.

When working on this comment, I realized we shouldn't really "break" on
the collation mismatch.  Multiple keys can have the same expression, but
different collation and "break"ing in this case would mean, we'd miss
matching it to another key that has the matching collation.  IOW, order in
which clauses appear in the input list determines if they're matched to
partition keys correctly or not.

See this somewhat made up example:

create table rp (a text) partition by range (substr(a, 1) collate "en_GB",
substr(a, 1) collate "en_US");
create table rp1 partition of rp for values from ('a', 'a') to ('a', 'e');
create table rp2 partition of rp for values from ('a', 'e') to ('a', 'z');
create table rp3 partition of rp for values from ('b', 'a') to ('b', 'e');

For the following query:

select * from rp where substr(a, 1) = 'e' collate "en_US" and substr(a, 1)
= 'a' collate "en_GB";

With the current code, we'll end up discarding (via break) the 1st clause
after its collation (en_US) doesn't match the 1st key's collation (en_GB)
and thus we'll end with only one clause, that is the 2nd one, being added
to matched clauses.  That would result in only p3 being pruned, whereas
both rp1 and rp3 should be pruned.

I've fixed that.  With the new code, both the expression and the collation
should match before we conclude that the clause matched the partition key
and then check other properties of the clause like operator strictness,
valueexpr volatility, etc.  If those other properties are not satisfied,
we can "break", because with those properties they won't be useful for
pruning, even if it matched some other key.

> 19. I might have caused this, but there's no such variable as 'cur'
> 
> /* cur is more restrictive, so replace the existing. */

Fixed this and a few other instances.

> 20. Is there a difference between
> partition_bound_has_default(context->boundinfo) and
> context->has_default_part? Any reason for both? Your code uses both. I
> don't yet understand why has_default_part was added.

One cannot use partition_bound_has_default(context->boundinfo) within
partprune.c, because PartitionBoundInfo's definition is private to
partition.c. Maybe, if someday we expose it by exporting it (and related
partition bound manipulating functions) in, say, a partbound.h, we won't
need to have to resort to the same value being stored in two different
places like this.

> That's all I have for now.
> 
> It's getting close. Good work!
> 
> I keep having to turn up my strictness level each time I review.  I'd
> like to be able to turn it up all the way earlier, but I fear I may
> still be reviewing v1 if I'd done that :-)

It's really great that you've looked into these patches in such great
detail, suggesting various performance improvements (big and small) and
also pointing out various edge cases.  Thank you very much! :)

Attached updated version.

Regards,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Rajkumar Raghuwanshi

Дата:

20 февраля 2018 г., 13:56:11

On Tue, Feb 20, 2018 at 12:34 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Attached updated version.

Hi Amit,

I have applied v29 patch-set on head and got "ERROR: operator 1209 is not a member of opfamily 1994" with below test case. Please take a look.

CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');

INSERT INTO part VALUES (100,'ABC',10);
INSERT INTO part VALUES (110,'DEF',20);
INSERT INTO part VALUES (120,'GHI',10);
INSERT INTO part VALUES (130,'JKL',100);

explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
ERROR: operator 1209 is not a member of opfamily 1994

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

20 февраля 2018 г., 14:21:22

Thanks Rajkumar for the test.

On 2018/02/20 16:56, Rajkumar Raghuwanshi wrote:
> I have applied v29 patch-set on head and  got "ERROR:  operator 1209 is not
> a member of opfamily 1994" with below test case. Please take a look.
> 
> CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
> CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
> CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
> CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
> CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');
> 
> INSERT INTO part VALUES (100,'ABC',10);
> INSERT INTO part VALUES (110,'DEF',20);
> INSERT INTO part VALUES (120,'GHI',10);
> INSERT INTO part VALUES (130,'JKL',100);
> 
> explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
> *ERROR:  operator 1209 is not a member of opfamily 1994*

An oversight in the v28 patch seems to have caused this.  Fixed in the
attached.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

21 февраля 2018 г., 07:19:01

v30-0004-Faster-partition-pruning.patch contains:

+create table coll_pruning_multi (a text) partition by range
(substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");

This'll likely work okay on Linux. Other collate tests seem to use
COLLATE "POSIX or "C" so that work cross-platform.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

21 февраля 2018 г., 07:53:19

On 2018/02/21 10:19, David Rowley wrote:
> v30-0004-Faster-partition-pruning.patch contains:
> 
> +create table coll_pruning_multi (a text) partition by range
> (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");
> 
> This'll likely work okay on Linux. Other collate tests seem to use
> COLLATE "POSIX or "C" so that work cross-platform.

Thanks.  I completely forgot about that.  I've rewritten those tests to
use "POSIX" and "C" in the attached.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

21 февраля 2018 г., 16:15:17

On 21 February 2018 at 14:53, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/02/21 10:19, David Rowley wrote:
>> v30-0004-Faster-partition-pruning.patch contains:
>>
>> +create table coll_pruning_multi (a text) partition by range
>> (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");
>>
>> This'll likely work okay on Linux. Other collate tests seem to use
>> COLLATE "POSIX or "C" so that work cross-platform.
>
> Thanks.  I completely forgot about that.  I've rewritten those tests to
> use "POSIX" and "C" in the attached.

Thanks for fixing. I made a pass over v31 and only see a few small things:

1. In get_partitions_for_keys() why is the
get_partitions_excluded_by_ne_datums call not part of
get_partitions_for_keys_list?

2. Still a stray "minoff += 1;" in get_partitions_for_keys_range

3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1;
would be nice to see the style unified here.

4. "other other"

 * that is, each of its fields other other than clauseinfo must be valid before

5. "a IS NULL" -> "an IS NULL":

 * Based on a IS NULL or IS NOT NULL clause that was matched to a partition

6. Can you add a warning in the header comment for
extract_partition_clauses() to explain "Note: the 'clauses' List may
be modified inside this function. Callers may like to make a copy of
important lists before passing them to this function.", or something
like that...

7. "null" -> "nulls"

* Only allow strict operators.  This will guarantee null are

8. "dicard" -> "discard"

* contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

21 февраля 2018 г., 16:44:08

Hi David.

Thanks for the review.

On 2018/02/21 19:15, David Rowley wrote:
> Thanks for fixing. I made a pass over v31 and only see a few small things:
> 
> 1. In get_partitions_for_keys() why is the
> get_partitions_excluded_by_ne_datums call not part of
> get_partitions_for_keys_list?

Hmm, there is a question of where exactly to put the call within
get_partitions_for_keys_list().  At the end would sound like an obvious
answer, but we tend to short-circuit return from that function at various
points, which it seems undesirable to change.  So, I left things as is here.

> 2. Still a stray "minoff += 1;" in get_partitions_for_keys_range

I actually found a few and changed them to ++ or --, as applicable.

> 
> 3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1;
> would be nice to see the style unified here.

Fixed all as mentioned above.

> 4. "other other"
> 
>  * that is, each of its fields other other than clauseinfo must be valid before

Fixed.

> 5. "a IS NULL" -> "an IS NULL":
> 
>  * Based on a IS NULL or IS NOT NULL clause that was matched to a partition

Fixed.

> 6. Can you add a warning in the header comment for
> extract_partition_clauses() to explain "Note: the 'clauses' List may
> be modified inside this function. Callers may like to make a copy of
> important lists before passing them to this function.", or something
> like that...

At least in my patch, extract_partition_clauses() is a local function with
just one caller, but I still don't see any problem with warning the
reader.  So, done.

> 7. "null" -> "nulls"
> 
> * Only allow strict operators.  This will guarantee null are
> 
> 8. "dicard" -> "discard"
> 
> * contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as

Fixed.

Please find attached updated patches.

Thanks,
Amit

On 2018/02/23 23:46, Robert Haas wrote:
> On Wed, Feb 21, 2018 at 5:44 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Please find attached updated patches.
> 
> Committed 0001 and 0002.

Thank you for committing and for the review.

> I'm having some difficulty wrapping my head around 0003 because it has
> minimal comments and no useful commit message.  I think, though, that
> it's actually broken.  Pre-patch, the logic in find_partition_scheme
> compares partopfamily, partopcintype, and parttypcoll and then asserts
> equality for parttyplen and parttypbyval; not coincidentally,
> PartitionSchemeData describes the latter two fields only as "cached
> data", so that the segregation of fields in PartitionSchemeData into
> two groups exactly matches what find_partition_scheme is actually
> doing.  However, with the patch, it turns into a sort of hodgepodge.
> parttypid is added into the "cached" section of PartitionSchemeData
> and partcollation to the primary section, but both values are
> compared, not asserted; parttypcoll moves from the "compared" section
> to the "asserted" section but the declaration in PartitionSchemeData
> stays where it was.
> 
> Moreover, there's no explanation of why this is getting changed.
> There's an existing comment that explains the motivation for what the
> code does today, which the patch does not modify:
> 
>   * We store the opclass-declared input data types instead of the partition key
>   * datatypes since the former rather than the latter are used to compare
>   * partition bounds. Since partition key data types and the opclass declared
>   * input data types are expected to be binary compatible (per ResolveOpClass),
>   * both of those should have same byval and length properties.
> 
> Obviously, this raises the issue of whether changing this is really
> the right thing to do in the first place, but at any rate it's
> certainly necessary for the comments to match what the code actually
> does.
> 
> Also, I find this not very helpful:
> 
> + * The collation of the partition key can differ from the collation of the
> + * underlying column, so we must store this separately.
> 
> If the comments about parttypcol and partcollation were clear enough
> (and I think they could use some work to distinguish them better),
> then this would be pretty much unnecessary -- clearly the only reason
> to store two things is if they might be different from each other.
> 
> It might be more useful to somehow explain how parttypid and
> partsupfunc are intended to be work/be used, but actually I don't
> think any satisfactory explanation is possible.  Either we have one
> partition scheme per partopcintype -- in which case parttypid is
> ill-defined because it could vary among relations with the same
> PartitionScheme -- or we have on per parttypid -- in which case,
> without some other change, partition-wise join will stop working
> between relations with different parttypids but the same
> partopcintype.

I think I'm convinced that partopcintype OIDs can be used where I thought
parttypid ones were necessary.  The pruning patch uses the respective OID
from this array when extracting the datum from an OpExpr to be compared
with the partition bound datums.  It's sensible, I now think, to require
the extracted datum to be of partition opclass declared input type, rather
than the type of the partition key involved.  So, I removed the parttypid
that I'd added to PartitionSchemeData.

Updated the comments to make clear the distinction between and purpose of
having both parttypcoll and partcollation.  Also expanded the comment
about partsupfunc a bit.

Attached updated patches.

Thanks,
Amit

Attached an updated version in which I incorporated some of the revisions
that David Rowley suggested to OR clauses handling (in partprune.c) that
he posted as a separate patch on the run-time pruning thread [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f8%2Bp-mXfFUiwR4xZ37STvgJeYF44yAjo5Rfxf92cFJyYQ%40mail.gmail.com

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

27 февраля 2018 г., 22:05:14

On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> You may say that partition bounds might have to be different too in this
> case and hence partition-wise join won't occur anyway, but I'm wondering
> if the mismatch of partcollation itself isn't enough to conclude that?

Yeah, you're right.  I think that this is just a bug in partition-wise
join, and that the partition scheme should just be using partcollation
instead of parttypcoll, as in the attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

partition-scheme-collation.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

28 февраля 2018 г., 07:12:07

On 2018/02/28 1:05, Robert Haas wrote:
> On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> You may say that partition bounds might have to be different too in this
>> case and hence partition-wise join won't occur anyway, but I'm wondering
>> if the mismatch of partcollation itself isn't enough to conclude that?
> 
> Yeah, you're right.  I think that this is just a bug in partition-wise
> join, and that the partition scheme should just be using partcollation
> instead of parttypcoll, as in the attached.

Ah, OK. I was missing that there is no need to have both parttypcoll and
partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
built from a bare PartitionKey (not PartitionSchemeData), and after that
point, parttypcoll no longer needs to kept around.

I noticed that there is a typo in the patch.

+    memcpy(part_scheme->partcollation, partkey->parttypcoll,

s/parttypcoll/partcollation/g

BTW, should there be a relevant test in partition_join.sql?  If yes,
attached a patch (partitionwise-join-collation-test-1.patch) to add one.

Also attached updated version of your patch (fixed the typo).

Thanks,
Amit

On 2018/03/01 2:23, Robert Haas wrote:
> On Tue, Feb 27, 2018 at 8:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Ah, OK. I was missing that there is no need to have both parttypcoll and
>> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
>> built from a bare PartitionKey (not PartitionSchemeData), and after that
>> point, parttypcoll no longer needs to kept around.
>>
>> I noticed that there is a typo in the patch.
>>
>> +       memcpy(part_scheme->partcollation, partkey->parttypcoll,
>>
>> s/parttypcoll/partcollation/g
> 
> Committed your version.

Thank you.

>> BTW, should there be a relevant test in partition_join.sql?  If yes,
>> attached a patch (partitionwise-join-collation-test-1.patch) to add one.
> 
> I don't feel strongly about it, but I'm not going to try to prevent
> you from adding one, either.

OK.  Attached is a revised version of that patch in case you consider
committing it, addressing Ashutosh's comment that the tables used in the
test should contain some data.

Thanks,
Amit

Вложения

partitionwise-join-collation-test-2.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

01 марта 2018 г., 10:53:42

Thanks Ashutosh and David for reviewing.  Replying to both.

On 2018/02/28 20:25, Ashutosh Bapat wrote:
> On Tue, Feb 27, 2018 at 3:03 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached an updated version in which I incorporated some of the revisions
>> that David Rowley suggested to OR clauses handling (in partprune.c) that
>> he posted as a separate patch on the run-time pruning thread [1].
> 
> Some comments on 0001.
> 
>              partnatts != part_scheme->partnatts)
>              continue;
> 
> -        /* Match the partition key types. */
> +        /*
> +         * Match the partition key types and partitioning-specific collations.
> +         */
> 
> We are comparing opfamily and opclass input type as well, but this comment
> doesn't explicitly mention those like it mentions collation. Instead, I think
> we should just say, "Match partition key type properties"

Sounds good, done.

> You had commented on "advanced partition matching code" about asserting that
> parsupfuncs also match. Robert too has expressed similar opinion upthread. May
> be we should at least try to assert that the function OIDs match.

I guess you're referring to this email of mine:

https://www.postgresql.org/message-id/e681c283-5fc6-b1c6-1bb9-7102c37e2d55%40lab.ntt.co.jp

Note that I didn't say that we should Assert the equality of partsupfunc
members themselves, but rather whether we could add a comment explaining
why we don't.  Although, like you say, we could Assert the equality of
function OIDs, so added a block like this:

+  /*
+   * If partopfamily and partopcintype matched, must have the same
+   * partition comparison functions.  Note that we cannot reliably
+   * Assert the equality of function structs themselves for they might
+   * be different across PartitionKey's, so just Assert for the function
+   * OIDs.
+   */
+#ifdef USE_ASSERT_CHECKING
+   {
+      int     i;
+
+      for (i = 0; i < partkey->partnatts; i++)
+          Assert(partkey->partsupfunc[i].fn_oid ==
+                 part_scheme->partsupfunc[i].fn_oid);
+   }
+#endif

> -    Oid           *parttypcoll;    /* OIDs of collations of partition keys. */
> +
> +    /*
> +     * We store both the collation implied by the partition key's type and the
> +     * one specified for partitioning.  Values in the former are used as
> +     * varcollid in the Vars corresponding to simple column partition keys so
> +     * as to make them match corresponding Vars appearing elsewhere in the
> +     * query tree.  Whereas, the latter is used when actually comparing values
> +     * against partition bounds datums, such as, when doing partition pruning.
> +     */
> +    Oid           *parttypcoll;
> +    Oid           *partcollation;
> 
> As you have already mentioned upthread only partcollation is needed, not
> parttypcoll.

This hunk is gone after rebasing over 2af28e603319 (For partitionwise
join, match on partcollation, not parttypcoll) that was committed earlier
today.

>      /* Cached information about partition key data types. */
>      int16       *parttyplen;
>      bool       *parttypbyval;
> +
> +    /*
> +     * Cached array of partitioning comparison functions' fmgr structs.  We
> +     * don't compare these when trying to match two partition schemes.
> +     */
> 
> I think this comment should go away. The second sentence doesn't explain why
> and if it did so it should do that in find_partition_scheme() not here.
> partsupfunc is another property of partition keys that is cached like
> parttyplen, parttypbyval. Why does it deserve a separate comment when others
> don't?

Replaced that comment with:

+    /* Cached information about partition comparison functions. */

As mentioned above, I already added a comment in find_partition_scheme().

On 2018/02/28 20:35, David Rowley wrote:
> Micro review of v34:
>
> 1. Looks like you've renamed the parttypid parameter in the definition
> of partkey_datum_from_expr and partition_cmp_args, but not updated the
> declaration too.
>
> +static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum
*value);
>
> +static bool
> +partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value)
>
> +static bool partition_cmp_args(Oid parttypid, Oid partopfamily,
> +    PartClause *pc, PartClause *leftarg, PartClause *rightarg,
> +    bool *result);
>
> +static bool
> +partition_cmp_args(Oid partopcintype, Oid partopfamily,
> +    PartClause *pc, PartClause *leftarg, PartClause *rightarg,
> +    bool *result)

Oops, forgot about the declarations.  Fixed.

> 2. In prune_append_rel_partitions(), it's not exactly illegal, but int
> i is declared twice in different scopes. Looks like there's no need
> for the inner one.

Removed the inner one.

Attached updated patches.

Thanks,
Amit

On 2018/03/07 20:58, Amit Langote wrote:
> Hi.
> 
> On 2018/03/05 17:38, Amit Langote wrote:
>> I'll
>> post an update in a couple of days to report on how that works out.
> 
> I'm still working on this and getting most of the tests to pass with the
> new code, but not all of them yet.

Sorry about the delay.

Attached is a significantly revised version of the patch, although I admit
it could still use some work with regard to comments and other cleanup.

The rewrite introduces a notion of PartitionPruneStep nodes based on the
ideas described in [1].  So, instead of aggregating *all* of the pruning
clauses into a PartitionClauseInfo which was hard to serialize into a node
tree and then a PartScanKeyInfo (both of which no longer exist), this
generates a list of nodes.  Each node inherits from the base
PartitionPruneStep node type and contains information enough to perform
partition pruning by directly comparing the information with partition
bounds or contains sub-nodes that do.  For example, a PartitionPruneStepOp
step contains an integer telling the partitioning operator strategy (such
as various btree operator strategies) and a tuple to compare against
partition bounds stored in the relcache.  A PartitionPruneStepCombine step
contains arguments that are in turn pruning steps themselves, which are
separately executed and partition sets obtained thereby are combined using
the specified combineOp.

Also, fixed a bug of the previous design as detailed in [2].  So, with the
patch:

create table lparted (a smallint) partition by list (a);
create table lparted_1 partition of lparted for values in (1);
create table lparted_16384 partition of lparted for values in (16384);

-- all partitions pruned (lparted_16384 wouldn't be pruned by previous
-- patches due to comparison using bogus a partsupfunc)

explain (costs off) select * from lparted where a = 100000000000000;
        QUERY PLAN
--------------------------
 Result
   One-Time Filter: false
(2 rows)

Also,

create table rparted (a smallint) partition by range (a);
create table rparted_1 partition of rparted for values from (1) to (10);
create table rparted_16384 partition of rparted for values from (10) to
(16384);
create table rparted_maxvalue partition of rparted for values from (16384)
to (maxvalue);

-- all partitions except rparted_maxvalue pruned
explain (costs off) select * from rparted where a > 100000000000000;
                   QUERY PLAN
-------------------------------------------------
 Append
   ->  Seq Scan on rparted_maxvalue
         Filter: (a > '100000000000000'::bigint)
(3 rows)

I will continue working on improving the comments / cleaning things up and
post a revised version soon, but until then please look at the attached.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYtKitwsFtA4%2B6cdeYGEfnS1%2BOY%2BG%3DUe26fgSzJZx%3DeJg%40mail.gmail.com

Hi David.

Thanks for the review.

On 2018/03/14 20:50, David Rowley wrote:
> On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is a significantly revised version of the patch, although I admit
>> it could still use some work with regard to comments and other cleanup.
> 
> Thanks for making all those changes. There's been quite a bit of churn!
> 
> I've looked over the patch and agree that there need to be more
> comments to explain things. There were certainly times during my
> review that I just skipped ahead having not understood what I'd been
> looking at.

Hope the attached version is easier to understand.

> Here are the notes from my review:
> 
> 1. I think get_unpruned_partitions is not a great name for this
> function. get_matching_partitions would be better.

OK, I changed it to get_matching_partitions.

> 2. HASH partitioning can no longer prune non-matching partitions in
> cases like: SELECT * FROM hashp WHERE partkey1 IS NULL and partkey2 IS
> NULL; This might mean you need to process IS NULL and the key clauses
> in the same step.

OK, I made the PartitionPruneStepOp to contain a bitmap of null key
indexes.  It is referred to along with "values" for other columns only in
the hash partitioning case, whereas for list and range partitioning,
actual value(s) and null key information are passed via separate
PartitionPruneStepOp nodes, because in case of the latter we cannot mix
nulls and "values".

So, now there is no need for PartitionPruneStepNullness and
get_partitions_for_null_keys().

> I see you have a test which checks the plan for
> 
> explain (costs off) select * from hp where a = 1 and b is null;
> 
> which ensures all partitions are included. Why?

I had intended to remove support for hash partition pruning with some keys
being null that existed in the previous versions of the patch, but thought
it was too pessimistic after reading your comment.  I've added it back and
it works like it used to in the previous versions.

> 3. Header comment for get_partitions_for_keys references 'keys' which
> is not a parameter to that function.

Oops, fixed.

> 4. If you wrote a function to process an individual step and called
> that in a foreach loop inside get_unpruned_partitions, then you
> wouldn't need to list_make1() here. Instead just call the function to
> perform one step.
> 
> argparts =  get_unpruned_partitions(context,
> list_make1(step));

Hmm, yes.  I've introduced a perform_pruning_step() that switches on the
pruning step node type, that is called by either get_matching_partitions
while iterating the list it receives or by perform_pruning_combine_step()
calls on the arguments of a PartitionPruneStepCombine.

> 5. I see you're performing AND step combination in two different ways.
> In perform_pruning_combine_step you do;
> 
> andparts = andparts == NULL
> ? argparts
> : bms_int_members(andparts, argparts);
> 
> but in get_unpruned_partitions, you add the entire range to the set
> using bms_add_range, then intersect on that.
> 
> The code seems a bit fragile and relies on get_unpruned_partitions
> returning an allocated Bitmapset all the time, even if bms_is_empty()
> is true. There should be no distinction between NULL and a Bitmapset
> that returns true on bms_is_empty().
> 
> What you've got here probably works for now, but only due to the fact
> that get_unpruned_partitions allocate the entire range with
> bms_add_range and that bms_int_members does not return NULL with no
> matching members if both input sets are non-NULL. If that were to
> change then your code would misbehave in cases like:
> 
> WHERE <matches no partition> AND <matches a partition>;
> 
> When processing the <matches no partition> clause, no partitions would
> match, then if that resulted in an empty set you'd then surprisingly
> match partitions again despite the AND clause not actually making it
> possible for any partitions to have matched.
> 
> Probably you just need to bms_add_range in
> perform_pruning_combine_step too and perform bms_int_members
> unconditionally, just like you're doing in get_unpruned_partitions

I noticed a number of things I could improve about this, also considering
your points above.  Please check if the new structure is an improvement.

> 6. The switch (last->op_strategy) in
> generate_partition_pruning_steps_internal is missing a default ERROR
> for unknown strategy

I've fixed that.

> 7. The switch: switch (context->strategy) in
> generate_partition_pruning_steps_internal should ERROR rather than
> break when it sees an unknown partition strategy.

This one too.

> 
> 8. Instead of copying the opclause here, wouldn't you be better just
> this code come after you palloc the PartClause then just setup the
> PartClause with the negator directly?
> 
> if (*is_neop_listp)
> {
> Assert(OidIsValid(negator));
> opclause = copyObject(opclause);
> opclause->opno = negator;
> }

Agreed, done.

> 8. PartitionedChildRelInfo is still mentioned in typedefs.list

Removed.

> 9. I don't quite understand PartitionPruneStepNoop. Why can't you just
> skip adding anything to the list in
> generate_partition_pruning_steps_internal?
> 
> 10. The following test does not make sense:
> 
> explain (costs off) select * from hp where (a = 10 and b = 'yyy') or
> (a = 10 and b = 'xxx') or (a is null and b is null);
>                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
>  Append
>    ->  Seq Scan on hp0
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>    ->  Seq Scan on hp1
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>    ->  Seq Scan on hp2
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>    ->  Seq Scan on hp3
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
> (9 rows)
> 
> Why do 4 partitions match when there are only 3 sets of clauses when
> each one can only match a single partition?

After bringing the support for hash partition pruning even with IS NULL
clauses back, this works as you'd expect.

> 11. What does "root" mean here?
> 
> -- case for list partitioned table that's not root
> explain (costs off) select * from rlp where a = 15 and b <> 'ab' and b
> <> 'cd' and b <> 'xy' and b is not null;

It means a partitioned table that is not the root partitioned table.
Those <> clauses prune but not at the root level, only after recursing for
a list partitioned child of rlp.

Attached updated patches.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

19 марта 2018 г., 13:18:48

On 17 March 2018 at 01:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hope the attached version is easier to understand.

Hi Amit,

Thanks for making the updates. I'll look at them soon.

I've been thinking about how we're making these improvements for
SELECT only.  If planning for an UPDATE or DELETE of a partitioned
table then since the inheritance planner is planning each partition
individually we gain next to nothing from this patch.

Generally, it seems the aim of this patch is to improve the usability
of partitions in an OLTP type workload, most likely OLAP does not
matter as much since planner overhead, in that case, is generally less
of a concern.

I experimented with the attached small patch to see if the situation
could be improved if we first plan the entire query with all
partitions then ignore dummy rels when planning for each individual
partition.

I used something along the lines of:

# create table listp (a int, b int) partition by list(a);
# select 'create table listp'||x||' partition of listp for values
in('||x||');' from generate_series(1, <number of tables>)x;
$ echo explain update listp set b = 1 where a = 1; > bench.sql
$ pgbench -f bench.sql -n -T 30 postgres

where <number of tables> started at 1 and went up in powers of 2 until 1024.

Unpatched = your v35 patch
Patched = your v35 + the attached.

The TPS result from a 30-second pgbench run of the above query showed:

Partitions = 1
Unpatched: 7323.3
Patched: 6573.2 (-10.24%)

Partitions = 2
Unpatched: 6784.8
Patched: 6377.1 (-6.01%)

Partitions = 4
Unpatched: 5903.0
Patched: 6106.8 (3.45%)

Partitions = 8
Unpatched: 4582.0
Patched: 5579.9 (21.78%)

Partitions = 16
Unpatched: 3131.5
Patched: 4521.2 (44.38%)

Partitions = 32
Unpatched: 1779.8
Patched: 3387.8 (90.35%)

Partitions = 64
Unpatched: 821.9
Patched: 2245.4 (173.18%)

Partitions = 128
Unpatched: 322.2
Patched: 1319.6 (309.56%)

Partitions = 256
Unpatched: 84.3
Patched: 731.7 (768.27%)

Partitions = 512
Unpatched: 22.5
Patched: 382.8 (1597.74%)

Partitions = 1024
Unpatched: 5.5
Patched: 150.1 (2607.83%)

Which puts the crossover point at just 4 partitions, and just a small
overhead for 1, 2 and probably 3 partitions. The planner generated a
plan 26 times faster (!) with 1024 partitions.

Likely there's more than could be squeezed out of this if we could get
the grouping_planner() to somehow skip creating paths and performing
the join search. But that patch is not nearly as simple as the
attached.

Probably grouping_planner could also be called with inheritance_update
= false, for this one case too, which might save a small amount of
effort.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

improve_performance_of_inheritance_planner.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

19 марта 2018 г., 16:03:35

Hi David.

On 2018/03/19 16:18, David Rowley wrote:
> On 17 March 2018 at 01:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Hope the attached version is easier to understand.
> 
> Hi Amit,
> 
> Thanks for making the updates. I'll look at them soon.
> 
> I've been thinking about how we're making these improvements for
> SELECT only.  If planning for an UPDATE or DELETE of a partitioned
> table then since the inheritance planner is planning each partition
> individually we gain next to nothing from this patch.

Unfortunately, yes. :-(

Just recently, I replied to a pgsql-bugs report by someone who had OOM
kill a backend running `delete from
partitioned_table_with_7202_partitions` on their test system [1].  That'd
be because running inheritance_planner on a partitioned table doesn't cope
very well beyond a few hundred partitions, as we've also written in our
partitioning/inheritance documentation.

> Generally, it seems the aim of this patch is to improve the usability
> of partitions in an OLTP type workload, most likely OLAP does not
> matter as much since planner overhead, in that case, is generally less
> of a concern.

Yes, that makes sense.

> I experimented with the attached small patch to see if the situation
> could be improved if we first plan the entire query with all
> partitions then ignore dummy rels when planning for each individual
> partition.
> 
> I used something along the lines of:
> 
> # create table listp (a int, b int) partition by list(a);
> # select 'create table listp'||x||' partition of listp for values
> in('||x||');' from generate_series(1, <number of tables>)x;
> $ echo explain update listp set b = 1 where a = 1; > bench.sql
> $ pgbench -f bench.sql -n -T 30 postgres
> 
> where <number of tables> started at 1 and went up in powers of 2 until 1024.
> 
> Unpatched = your v35 patch
> Patched = your v35 + the attached.
> 
> The TPS result from a 30-second pgbench run of the above query showed:
> 
> Partitions = 1
> Unpatched: 7323.3
> Patched: 6573.2 (-10.24%)
> 
> Partitions = 2
> Unpatched: 6784.8
> Patched: 6377.1 (-6.01%)
> 
> Partitions = 4
> Unpatched: 5903.0
> Patched: 6106.8 (3.45%)
> 
> Partitions = 8
> Unpatched: 4582.0
> Patched: 5579.9 (21.78%)
> 
> Partitions = 16
> Unpatched: 3131.5
> Patched: 4521.2 (44.38%)
> 
> Partitions = 32
> Unpatched: 1779.8
> Patched: 3387.8 (90.35%)
> 
> Partitions = 64
> Unpatched: 821.9
> Patched: 2245.4 (173.18%)
> 
> Partitions = 128
> Unpatched: 322.2
> Patched: 1319.6 (309.56%)
> 
> Partitions = 256
> Unpatched: 84.3
> Patched: 731.7 (768.27%)
> 
> Partitions = 512
> Unpatched: 22.5
> Patched: 382.8 (1597.74%)
> 
> Partitions = 1024
> Unpatched: 5.5
> Patched: 150.1 (2607.83%)
>
> Which puts the crossover point at just 4 partitions, and just a small
> overhead for 1, 2 and probably 3 partitions. The planner generated a
> plan 26 times faster (!) with 1024 partitions.

Nice!

> Likely there's more than could be squeezed out of this if we could get
> the grouping_planner() to somehow skip creating paths and performing
> the join search. But that patch is not nearly as simple as the
> attached.

Yeah, that'd be nice.  Do you think that we cannot fix update/delete on
partitioned tables until we have such a patch though?  IOW, did you intend
the patch you posted to just be a PoC to demonstrate that we can save tons
just by not doing grouping_planner() on pruned partitions?

BTW, maybe you know, but if we want this to prune same partitions as are
pruned during select (due to the new pruning facility), we'd need to teach
get_relation_constraints() to not fetch the partition constraint
(RelationGetPartitionQual) at all.  My patch currently teaches it to avoid
fetching the partition constraint only for select.  If we include the
partition constraint in the list of constraints returned by
get_relation_constraints, we'd still be redundantly executing the
constraint exclusion logic for the selected partitions via the
grouping_planner() call on those partitions.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/fecdef72-8c2a-0794-8e0a-2ad76db82c68@lab.ntt.co.jp

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

19 марта 2018 г., 16:35:43

On 19 March 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Just recently, I replied to a pgsql-bugs report by someone who had OOM
> kill a backend running `delete from
> partitioned_table_with_7202_partitions` on their test system [1].  That'd
> be because running inheritance_planner on a partitioned table doesn't cope
> very well beyond a few hundred partitions, as we've also written in our
> partitioning/inheritance documentation.

hmm, yeah that's unfortunate. I'd not done much study of the
inheritance planner before, but I see how that could happen now that I
understand a bit more about it.  nparts^2 RelOptInfos will be created
on such problems. My patch should help with that providing that some
pruning will actually take place, but make the problem very slightly
worse if none can be pruned.

> On 2018/03/19 16:18, David Rowley wrote:
>> Likely there's more than could be squeezed out of this if we could get
>> the grouping_planner() to somehow skip creating paths and performing
>> the join search. But that patch is not nearly as simple as the
>> attached.
>
> Yeah, that'd be nice.  Do you think that we cannot fix update/delete on
> partitioned tables until we have such a patch though?  IOW, did you intend
> the patch you posted to just be a PoC to demonstrate that we can save tons
> just by not doing grouping_planner() on pruned partitions?

The patch was meant as a PoC. I think the performance of the patch is
acceptable without any additional optimisation work. It would be nice,
but any more code that's added would need more reviewer and committer
time, both of which are finite, especially so before PG11 code cut
off.

I think it would be a shame to tell people partition is usable now for
a decent number of partitions, providing you don't need to perform any
OLTP UPDATE/DELETE operations on the partitions. I think for the few
lines of code that the proposed patch takes it's worth considering for
PG11, but only once your work has gone in. I certainly wouldn't want
this to hold your work back.

> BTW, maybe you know, but if we want this to prune same partitions as are
> pruned during select (due to the new pruning facility), we'd need to teach
> get_relation_constraints() to not fetch the partition constraint
> (RelationGetPartitionQual) at all.  My patch currently teaches it to avoid
> fetching the partition constraint only for select.  If we include the
> partition constraint in the list of constraints returned by
> get_relation_constraints, we'd still be redundantly executing the
> constraint exclusion logic for the selected partitions via the
> grouping_planner() call on those partitions.

I'd not thought of that. It seems more like a performance optimisation
than something that's required for correctness.  Removing that would
probably make constraint_exclusion = 'partition' pretty useless

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

20 марта 2018 г., 17:07:06

On 2018/03/16 21:55, Amit Langote wrote:
> Attached updated patches.

Attached is further revised version.

Of note is getting rid of PartitionPruneContext usage in the static
functions of partprune.c.  Most of the code there ought to only run during
planning, so it can access the necessary information from RelOptInfo
directly instead of copying it to PartitionPruneContext and then passing
it around.

Thanks,
Amit

On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is further revised version.

Hi Amit,

Thanks for sending the v38 patch.

I started reviewing this, but I just ended up starting to hack at the
patch instead. It's still got quite a bit of work to be done as I
think, unfortunately,  the cross type stuff is still pretty broken.
There's not really any sort of checking done to make sure you've found
a valid cross type hash or compare function in the code which results
in errors like:

create table hashp (a int, b numeric) partition by hash(a,b);
create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
create table hashp3 partition of hashp for values with (modulus 4, remainder 2);
create table hashp4 partition of hashp for values with (modulus 4, remainder 3);
explain select * from hashp where a = 3::smallint and b = 1.0;
ERROR:  cache lookup failed for function 0

I'm not really sure if this should be a matter of doing an if
(!OidIsValid(new_supfuncid)) return false; I think the
context->partsupfunc must be pretty broken in cases like:

create table listp (a bigint) partition by list(a);
create table listp1 partition of listp for values in(1);
select * from listp where a <> 1::smallint and a <> 1::bigint;

The current patch simply just remembers the last comparison function
for comparing int8 to int4 and uses that one for the int8 to int2
comparison too.

Probably we need to cache the comparison function's Oid in with the
Expr in the step and use the correct one each time. I'm unsure of how
the fmgr info should be cached, but looks like it certainly cannot be
cached in the context in an array per partition key. I've so far only
thought some sort of hash table, but I'm sure there must be a much
better way to do this.

I started hacking it partition.c and ended up changing quite a few
things. I changed get_partitions_for_keys into 3 separate functions,
one for hash, list and range and tidied a few things up in that area.
There were a few bugs, for example passing the wrong value for the
size of the array into get_partitions_excluded_by_ne_datums.

I also changed how the Bitmapsets are handled in the step functions
and got rid of the Noop step type completely. I also got rid of the
passing of the srcparts into these functions. I think Roberts idea is
to process the steps in isolation and just combine the partitions
matching each step.

It would be great if we could coordinate our efforts here. I'm posting
this patch now just in case you're working or about to work on this.

In the meantime, I'll continue to drip feed cleanup patches. I'll try
to start writing some comments too, once I figure a few things out...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v38_drowley_delta1.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

23 марта 2018 г., 18:15:47

Hi David.

On 2018/03/23 16:38, David Rowley wrote:
> On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is further revised version.
> 
> Hi Amit,
> 
> Thanks for sending the v38 patch.
> 
> I started reviewing this, but I just ended up starting to hack at the
> patch instead. It's still got quite a bit of work to be done as I
> think, unfortunately,  the cross type stuff is still pretty broken.
> There's not really any sort of checking done to make sure you've found
> a valid cross type hash or compare function in the code which results
> in errors like:
> 
> create table hashp (a int, b numeric) partition by hash(a,b);
> create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
> create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
> create table hashp3 partition of hashp for values with (modulus 4, remainder 2);
> create table hashp4 partition of hashp for values with (modulus 4, remainder 3);
> explain select * from hashp where a = 3::smallint and b = 1.0;
> ERROR:  cache lookup failed for function 0

Hmm yes.  I had realized that while addressing Robert's related comment.

> I'm not really sure if this should be a matter of doing an if
> (!OidIsValid(new_supfuncid)) return false; I think the
> context->partsupfunc must be pretty broken in cases like:
> 
> create table listp (a bigint) partition by list(a);
> create table listp1 partition of listp for values in(1);
> select * from listp where a <> 1::smallint and a <> 1::bigint;
> 
> The current patch simply just remembers the last comparison function
> for comparing int8 to int4 and uses that one for the int8 to int2
> comparison too.
> 
> Probably we need to cache the comparison function's Oid in with the
> Expr in the step and use the correct one each time. I'm unsure of how
> the fmgr info should be cached, but looks like it certainly cannot be
> cached in the context in an array per partition key. I've so far only
> thought some sort of hash table, but I'm sure there must be a much
> better way to do this.

Yeah, I realized that simply replacing the context->partsupfunc member is
not a solution.

In the updated patch (that is, after incorporating your changes), I have
moved this partsupfunc switching to the caller of partkey_datum_from_expr
instead of doing it there.  New patch also checks that returned function
OID is valid, which if not we don't use the expression's value for pruning.

So now. we statically allocate a partsupfunc array on every invocation of
perform_pruning_base_step() or of get_partitions_excluded_by_ne_datums().
Considering run-time pruning, we may have to find some other place to
cache that.

> I started hacking it partition.c and ended up changing quite a few
> things. I changed get_partitions_for_keys into 3 separate functions,
> one for hash, list and range and tidied a few things up in that area.
> There were a few bugs, for example passing the wrong value for the
> size of the array into get_partitions_excluded_by_ne_datums.
> 
> I also changed how the Bitmapsets are handled in the step functions
> and got rid of the Noop step type completely. I also got rid of the
> passing of the srcparts into these functions. I think Roberts idea is
> to process the steps in isolation and just combine the partitions
> matching each step.>
> It would be great if we could coordinate our efforts here. I'm posting
> this patch now just in case you're working or about to work on this.

Thanks a lot for making all those changes and sharing the patch.  I've
incorporated in the attached latest version.

> In the meantime, I'll continue to drip feed cleanup patches. I'll try
> to start writing some comments too, once I figure a few things out...

Here is the updated version.

I'm still thinking about what to do about avoiding recursion when
performing combine steps [1] as Robert mentioned in his email.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

23 марта 2018 г., 18:19:27

Thanks for the review.

On 2018/03/21 6:29, Robert Haas wrote:
> On Tue, Mar 20, 2018 at 7:07 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/03/16 21:55, Amit Langote wrote:
>>> Attached updated patches.
>>
>> Attached is further revised version.
>>
>> Of note is getting rid of PartitionPruneContext usage in the static
>> functions of partprune.c.  Most of the code there ought to only run during
>> planning, so it can access the necessary information from RelOptInfo
>> directly instead of copying it to PartitionPruneContext and then passing
>> it around.
> 
> +    if (rte->relkind == RELKIND_PARTITIONED_TABLE)
> +    {
> +        if (rel->baserestrictinfo != NIL)
> +        {
> +            live_children = prune_append_rel_partitions(root, rel);
> +            did_pruning = true;
> +        }
> +    }
> 
> Use &&

Fixed in the latest version.

> 
> +        case COMBINE_OR:
> +        {
> 
> Won't survive pgindent, which currently produces a *massive* diff for
> these patches.

That's gone in the latest patch.

Things for the overall patch should have improved in the latest version.

> +            /*
> +             * XXX- The following ad-hoc method of pruning only works for list
> +             * partitioning.  It checks for each partition if all of its
> +             * accepted values appear in ne_datums[].
> +             */
> 
> So why are we doing it this way?  How about doing something not
> ad-hoc?  I tried to propose that before.

Hmm, perhaps I should have written a better comment.

What I really meant to say is that pruning using <> operators can be
implemented sanely only for list partitions.  We can prune a given
partition with <> clauses, only if we can find a <> clause for *all*
values that the partition accepts.  Doing so seems doable only for list
partitions where we require enumerating all values that the partition may
contain.

> + *      Set *value to the constant value obtained by evaluating 'expr'
> + *
> + * Note that we may not be able to evaluate the input expression, in which
> + * case, the function returns false to indicate that *value has not been
> + * set.  True is returned otherwise.
> 
> These comments need updating, since this function (laudibly) no longer
> does any evaluating.  I wonder how this will work for run-time
> pruning, though.

Fixed the comment.

Run-time pruning patch adds some code to this function to be able to
return values for Params for which, afaik, it also adds some state
information to the PartitionPruneContext argument.

> +    if (context->partopcintype[partkeyidx] != exprTyp)
> +    {
> +        Oid     new_supfuncid;
> +        int16   procnum;
> +
> +
> +        procnum = (context->strategy == PARTITION_STRATEGY_HASH)
> +                        ? HASHEXTENDED_PROC
> +                        : BTORDER_PROC;
> +        new_supfuncid = get_opfamily_proc(context->partopfamily[partkeyidx],
> +                                          context->partopcintype[partkeyidx],
> +                                          exprTyp, procnum);
> +        fmgr_info(new_supfuncid, &context->partsupfunc[partkeyidx]);
> +    }
> 
> What's the point of this, exactly?  Leftover dead code, maybe?

Actually, this *was* an effort to teach the patch to use the correct
comparison function for comparison against partition bounds in case of
clause value being of different type.

After reading David's comment about this, I concluded that it's placed at
a wrong place, which is fixed in the latest patch.   The comparison
functions are changed (if needed) in the function that would call
partkey_datum_from_expr, not in partkey_datum_from_expr itself.

> + * Input:
> + *  See the comments above the definition of PartScanKeyInfo to see what
> + *  kind of information is contained in 'keys'.
> 
> There's no such thing as PartScanKeyInfo any more and the function has
> no argument called 'keys'.  None of the functions actual arguments are
> explained.

Sorry, this should be gone in the latest patches.

> +    /*
> +     * If there are multiple pruning steps, we perform them one after another,
> +     * passing the result of one step as input to another.  Based on the type
> +     * of pruning step, perform_pruning_step may add or remove partitions from
> +     * the set of partitions it receives as the input.
> +     */
> 
> The comment sounds great, but the code doesn't work that way; it
> always calls bms_int_members to intersect the new result with any
> previous result.  I'm baffled as to how this manages to DTRT if
> COMBINE_OR is used.  In general I had hoped that the list of pruning
> steps was something over which we were only going to iterate, not
> recurse.  This definitely recurses for the combine steps, but it's
> still (sorta) got the idea of a list of iterable steps.  That's a
> weird mix.

At the top-level (in get_matching_partitions), it is assumed that the
steps in the input list come from implicitly AND'd clauses, so the
intersection between partition sets that we get for each.

Anyway, after David's rewrite of this portion of the patch incorporated in
the latest patch, things look a bit different here, although there is
still recursion for combine steps.  I'm still considering how to make the
recursion go away.

> +            if (nvalues == context->partnatts)
> +            {
> +                greatest_modulus = get_greatest_modulus(boundinfo);
> +                rowHash = compute_hash_value(partnatts, partsupfunc, values,
> +                                             isnull);
> +                result_index = partindices[rowHash % greatest_modulus];
> +                if (result_index >= 0)
> +                    return bms_make_singleton(result_index);
> +            }
> +            else
> +                /* Can't do pruning otherwise, so return all partitions. */
> +                return bms_add_range(NULL, 0, context->nparts - 1);
> 
> Wouldn't we want to (1) arrange things so that this function is never
> called if nvalues < context->partnatts && context->strategy ==
> PARTITION_STRATEGY_HASH or at least (2) avoid constructing isnull from
> nullkeys if we're not going to use it?

We call this function even if nvalues < context->partnatts but we found IS
NULL clauses for *all* the remaining columns.

Although, given the checks that planner (partprune.c) performs, we should
get here only if pruning is possible, so the code to handle cases where
pruning couldn't occur is redundant.

> Also, shouldn't we be sanity-checking the strategy number here?

That's right, fixed.

> 
> I'm out of time for right now but it looks to me like this patch still
> needs quite a bit of fine-tuning.

I have posted an updated patch in reply to David's review.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

24 марта 2018 г., 09:42:06

On 24 March 2018 at 01:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> In the updated patch (that is, after incorporating your changes), I have
> moved this partsupfunc switching to the caller of partkey_datum_from_expr
> instead of doing it there.  New patch also checks that returned function
> OID is valid, which if not we don't use the expression's value for pruning.

Thanks for accepting those changes.

> So now. we statically allocate a partsupfunc array on every invocation of
> perform_pruning_base_step() or of get_partitions_excluded_by_ne_datums().
> Considering run-time pruning, we may have to find some other place to
> cache that.

hmm yeah, it's not perfect, but I don't have any better ideas for now,
apart from this probably could be done when creating the steps rather
than executing them. That would save having to look up the correct
function Oid during execution, and save bothering to create steps
values that we simply can't compare to the partition key.

I've done this in the attached patch against v39.

I also renamed argvalues to argexprs, since they're not values.  The
PartClauseInfo could probably do with the same change too, but I
didn't touch it.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v39_drowley_delta1.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

25 марта 2018 г., 11:28:54

On 24 March 2018 at 16:42, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've done this in the attached patch against v39.
>
> I also renamed argvalues to argexprs, since they're not values.  The
> PartClauseInfo could probably do with the same change too, but I
> didn't touch it.

The attached goes a little further and does a bit more renaming. I
don't think "values" is a good name for a list of Exprs. I'd expect
that might be a better-suited name for an array of Datums.

I've also added and modified a few comments.  More comments are still
required. The Step structs are mostly undocumented still, but I'm
trying to understand how all this fits together still, at least well
enough to write about it.

The attached delta applies on top of v39 plus delta1.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v39_drowley_delta2.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

25 марта 2018 г., 11:32:12

On 25 March 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
> The attached delta applies on top of v39 plus delta1.

Sorry, the attached should do this. Ignore the last attachment.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v39_drowley_delta2_1.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

26 марта 2018 г., 17:42:57

Hi David.

On 2018/03/25 14:32, David Rowley wrote:
> On 25 March 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> The attached delta applies on top of v39 plus delta1.
> 
> Sorry, the attached should do this. Ignore the last attachment.

I have incorporated both of your delta1 and delta2_1 patches.

Your proposed change to determine the cross-type comparison function OID
during planning itself is a good one, although I wasn't sure why it was
done only for the <> operators.  I also implemented that for
PartitionPruneStepOp steps.

Also, I started thinking that implementing pruning using <> operators with
a PartitionPruneCombineOp was not such a great idea.  That needed us to
add argexprs and argcmpfns to that struct, which seemed a bit odd.  I
defined a new pruning node type called PartitionPruneStepOpNe, which still
seems a bit odd, but given that our support for pruning using <> is quite
specialized, that may be fine.

I added a bunch of hopefully informative comments in partprune.c and for
the struct definitions of pruning step nodes.

Please find attached find a new version.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

27 марта 2018 г., 16:42:42

On 2018/03/23 21:19, Amit Langote wrote:
> On 2018/03/21 6:29, Robert Haas wrote:
>> +    /*
>> +     * If there are multiple pruning steps, we perform them one after another,
>> +     * passing the result of one step as input to another.  Based on the type
>> +     * of pruning step, perform_pruning_step may add or remove partitions from
>> +     * the set of partitions it receives as the input.
>> +     */
>>
>> The comment sounds great, but the code doesn't work that way; it
>> always calls bms_int_members to intersect the new result with any
>> previous result.  I'm baffled as to how this manages to DTRT if
>> COMBINE_OR is used.  In general I had hoped that the list of pruning
>> steps was something over which we were only going to iterate, not
>> recurse.  This definitely recurses for the combine steps, but it's
>> still (sorta) got the idea of a list of iterable steps.  That's a
>> weird mix.
> 
> At the top-level (in get_matching_partitions), it is assumed that the
> steps in the input list come from implicitly AND'd clauses, so the
> intersection between partition sets that we get for each.
> 
> Anyway, after David's rewrite of this portion of the patch incorporated in
> the latest patch, things look a bit different here, although there is
> still recursion for combine steps.  I'm still considering how to make the
> recursion go away.

I have managed to make the recursion go away in the attached updated
version.  I guess that's the result of employing the idea of a "output
register" for individual pruning steps as mentioned in Robert's email
upthread where he detailed the "pruning steps" approach [1].

With the new patch, pruning steps for arguments of, say, an OR clause are
not performed recursively.  Instead, each pruning step is performed
independently and its output is stored in a slot dedicated to it.  Combine
steps are always executed after all of the steps corresponding to its
arguments have been executed.  That's ensured by the way steps are allocated.

Jesper, off-list, reported an unused variable which has been removed in
the updated patch.  Thanks Jesper!  He also pointed out a case with a
list-partitioned table where pruning doesn't a produce a result as one
would expect and what constraint exclusion would produce.

create table lp (a char) partition by list (a);
create table lp_ad partition of lp for values in ('a', 'd');
create table lp_bc partition of lp for values in ('b', 'c');
create table lp_default partition of lp default;
explain (costs off) select * from lp where a > 'a' and a < 'd';
                        QUERY PLAN
-----------------------------------------------------------
 Append
   ->  Seq Scan on lp_ad
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
   ->  Seq Scan on lp_bc
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
   ->  Seq Scan on lp_default
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
(7 rows)

One would expect that lp_ad is not scanned.

With the implementation where a > 'a' and a < 'd' are used to prune
separately, this cannot be avoided given the way
get_partitions_for_keys_list() added by the patch works.  What happens is
we prune first with a step generated for a > 'a', which returns partitions
for all datums in the table's boundinfo greater than 'a' that have a
partition assigned, which means we'll include the partition that accepts
'd'.  Then when pruning with a < 'd', we select partitions for all datums
less than 'd' that have a partition assigned, which means we end up
including the partition that accepts 'a'.  We intersect the result of
running these independent steps, but lp_ad is present in the result sets
of both the sets, so it ends up in the final result.  Maybe there is a way
to fix that, but I haven't done anything about it yet.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

27 марта 2018 г., 16:51:56

On 27 March 2018 at 00:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Also, I started thinking that implementing pruning using <> operators with
> a PartitionPruneCombineOp was not such a great idea.  That needed us to
> add argexprs and argcmpfns to that struct, which seemed a bit odd.  I
> defined a new pruning node type called PartitionPruneStepOpNe, which still
> seems a bit odd, but given that our support for pruning using <> is quite
> specialized, that may be fine.

Seems better

> I added a bunch of hopefully informative comments in partprune.c and for
> the struct definitions of pruning step nodes.

Yes. That looks better.

> Please find attached find a new version.

Thanks. I've made a pass over this and I only have the attached set of
fixes and the following to show for it.

1. Please add more comments in the switch statement in
get_partitions_for_keys_range

2. More an observation than anything else. I see we've lost the
ability to prune range queries on LIST partitions in some cases.

For example:

CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);

EXPLAIN SELECT * FROM listp WHERE a > 1 AND a < 3;

This is just down to the new pruning step design. WHERE we first prune
on "a > 1", which matches listp1_3 due to 3, then binary-AND to the
results of the "a < 3", which matches listp1_3 due to 1. This is a
shame, but probably not the end of the world. Fixing it would likely
mean moving back towards the previous design.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v40_drowley_fixes.patch

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

27 марта 2018 г., 23:46:49

Hi Amit,

On 03/27/2018 06:42 AM, Amit Langote wrote:
> I have managed to make the recursion go away in the attached updated
> version.  I guess that's the result of employing the idea of a "output
> register" for individual pruning steps as mentioned in Robert's email
> upthread where he detailed the "pruning steps" approach [1].
> 
> With the new patch, pruning steps for arguments of, say, an OR clause are
> not performed recursively.  Instead, each pruning step is performed
> independently and its output is stored in a slot dedicated to it.  Combine
> steps are always executed after all of the steps corresponding to its
> arguments have been executed.  That's ensured by the way steps are allocated.
> 

Running v41 with "partition_prune" under valgrind gives the attached report.

Best regards,
  Jesper

Вложения

25970.log.gz

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

27 марта 2018 г., 23:54:35

Amit Langote wrote:

> [Jesper] also pointed out a case with a
> list-partitioned table where pruning doesn't a produce a result as one
> would expect and what constraint exclusion would produce.
> 
> create table lp (a char) partition by list (a);
> create table lp_ad partition of lp for values in ('a', 'd');
> create table lp_bc partition of lp for values in ('b', 'c');
> create table lp_default partition of lp default;
> explain (costs off) select * from lp where a > 'a' and a < 'd';
>                         QUERY PLAN
> -----------------------------------------------------------
>  Append
>    ->  Seq Scan on lp_ad
>          Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>    ->  Seq Scan on lp_bc
>          Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>    ->  Seq Scan on lp_default
>          Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
> (7 rows)
> 
> One would expect that lp_ad is not scanned.

One would?  I, for one, wouldn't particularly sweat over this case TBH.
It seems a pretty silly case.  If this works for "a <> 'a' and a <> 'd'"
(I mean, lp_ad is pruned for that qual), that sounds sufficient to me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Tom Lane

Дата:

28 марта 2018 г., 00:12:38

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Amit Langote wrote:
>> [Jesper] also pointed out a case with a
>> list-partitioned table where pruning doesn't a produce a result as one
>> would expect and what constraint exclusion would produce.
>> 
>> create table lp (a char) partition by list (a);
>> create table lp_ad partition of lp for values in ('a', 'd');
>> create table lp_bc partition of lp for values in ('b', 'c');
>> create table lp_default partition of lp default;
>> explain (costs off) select * from lp where a > 'a' and a < 'd';
>> QUERY PLAN
>> -----------------------------------------------------------
>> Append
>> ->  Seq Scan on lp_ad
>> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>> ->  Seq Scan on lp_bc
>> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>> ->  Seq Scan on lp_default
>> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>> (7 rows)
>> 
>> One would expect that lp_ad is not scanned.

> One would?  I, for one, wouldn't particularly sweat over this case TBH.

That example works in HEAD, so if somebody is proposing a patch that
breaks it, seems like that needs investigation.

            regards, tom lane

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

28 марта 2018 г., 00:52:19

Hi,

On 03/27/2018 01:46 PM, Jesper Pedersen wrote:
> Running v41 with "partition_prune" under valgrind gives the attached 
> report.
> 

The reports mostly involve interaction with catcache.c and dynahash.c, 
so something for a separate thread.

Best regards,
  Jesper

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

28 марта 2018 г., 09:58:08

On 27 March 2018 at 23:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I have managed to make the recursion go away in the attached updated
> version.  I guess that's the result of employing the idea of a "output
> register" for individual pruning steps as mentioned in Robert's email
> upthread where he detailed the "pruning steps" approach [1].

Thanks for making that work. I've only glanced at the patch, and not
taken enough time to understand how the new parts work yet.

In the meantime, I've attached some fixes for v41 which I previously
submitted for v40.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v41_drowley_fixes.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

28 марта 2018 г., 15:29:26

On 2018/03/28 12:58, David Rowley wrote:
> On 27 March 2018 at 23:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I have managed to make the recursion go away in the attached updated
>> version.  I guess that's the result of employing the idea of a "output
>> register" for individual pruning steps as mentioned in Robert's email
>> upthread where he detailed the "pruning steps" approach [1].
> 
> Thanks for making that work. I've only glanced at the patch, and not
> taken enough time to understand how the new parts work yet.
> 
> In the meantime, I've attached some fixes for v41 which I previously
> submitted for v40.

Thank you.  I've merged it.

Also, I have redesigned how we derive partition indexes after running
pruning steps.  Previously, for each step we'd determine the indexes of
"partitions" that are not pruned leading to a list partition not being
pruned sometimes, as shown in the two recent examples.  Instead, in the
new approach, we only keep track of the indexes of the "datums" that
satisfy individual pruning steps (both base pruning steps and combine
steps) and only figure out the partition indexes after we've determined
set of datums that survive all pruning steps.  That is, after we're done
executing all pruning steps.  Whether we need to scan special partitions
like null-only and default partition is tracked along with datum indexes
for each step.  With this change, pruning works as expected in both examples:

create table lp (a char) partition by list (a);
create table lp_ad partition of lp for values in ('a', 'd');
create table lp_bc partition of lp for values in ('b', 'c');
create table lp_default partition of lp default;
explain (costs off) select * from lp where a > 'a' and a < 'd';
                        QUERY PLAN
-----------------------------------------------------------
 Append
   ->  Seq Scan on lp_bc
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
   ->  Seq Scan on lp_default
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
(5 rows)

CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN (1, 3);
EXPLAIN SELECT * FROM listp WHERE a > 1 AND a < 3;
                QUERY PLAN
------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=4)
   One-Time Filter: false
(2 rows)

Moreover, with pruning now working at a high-level with datum indexes
instead of partition indexes, pruning for PartitionPruneStepOpNe is
simplified greatly.  We simply delete from a bitmapset initially
containing the indexes of all datums in boundinfo the indexes of those
that appear in the query.  So:

explain (costs off) select * from lp where a <> 'a' and a <> 'd';
                         QUERY PLAN
-------------------------------------------------------------
 Append
   ->  Seq Scan on lp_bc
         Filter: ((a <> 'a'::bpchar) AND (a <> 'd'::bpchar))
   ->  Seq Scan on lp_default
         Filter: ((a <> 'a'::bpchar) AND (a <> 'd'::bpchar))
(5 rows)

where we delete indexes of 'a' and 'd' from the bitmapset initially
containing indexes of all datums, leaving us with only those of 'b' and
'c'.  Also, the default partition is scanned as it would always be for a
PartitionPruneStepOpNe step.

Attached is the updated set of patches, which contains other miscellaneous
changes such as updated comments, beside the main changes described above.

Regards,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

28 марта 2018 г., 16:30:56

On 2018/03/28 18:29, Amit Langote wrote:
> Attached is the updated set of patches, which contains other miscellaneous
> changes such as updated comments, beside the main changes described above.

Sorry, one of those miscellaneous changes was a typo that would cause
compilation to fail... Sigh.   Fixed in the updated version.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

28 марта 2018 г., 16:46:29

On 28 March 2018 at 22:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Also, I have redesigned how we derive partition indexes after running
> pruning steps.  Previously, for each step we'd determine the indexes of
> "partitions" that are not pruned leading to a list partition not being
> pruned sometimes, as shown in the two recent examples.  Instead, in the
> new approach, we only keep track of the indexes of the "datums" that
> satisfy individual pruning steps (both base pruning steps and combine
> steps) and only figure out the partition indexes after we've determined
> set of datums that survive all pruning steps.  That is, after we're done
> executing all pruning steps.  Whether we need to scan special partitions
> like null-only and default partition is tracked along with datum indexes
> for each step.  With this change, pruning works as expected in both examples:

Smart thinking! Good to see that solved.

I'll try to look at v43 during my working day tomorrow, in around 9 hours time.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

28 марта 2018 г., 22:41:15

Hi,

On 03/28/2018 06:30 AM, Amit Langote wrote:
> On 2018/03/28 18:29, Amit Langote wrote:
>> Attached is the updated set of patches, which contains other miscellaneous
>> changes such as updated comments, beside the main changes described above.
> 
> Sorry, one of those miscellaneous changes was a typo that would cause
> compilation to fail... Sigh.   Fixed in the updated version.
>

Just some trivial changes.

However,

explain (costs off) select * from mc2p where a = 2 and b < 1;

is picking up

    ->  Seq Scan on mc2p2
          Filter: ((b < 1) AND (a = 2))

which doesn't seem right, as its definition is

create table mc2p2 partition of mc2p for values from (1, 1) to (2, 
minvalue);

Best regards,
  Jesper

Вложения

delta_v43.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

29 марта 2018 г., 14:35:13

On 2018/03/29 1:41, Jesper Pedersen wrote:
> Just some trivial changes.

Thanks Jesper.  Merged.

> However,
> 
> explain (costs off) select * from mc2p where a = 2 and b < 1;
> 
> is picking up
> 
>    ->  Seq Scan on mc2p2
>          Filter: ((b < 1) AND (a = 2))
> 
> which doesn't seem right, as its definition is
> 
> create table mc2p2 partition of mc2p for values from (1, 1) to (2, minvalue);

Yeah, that wasn't right.  It boiled down to how some code in the range
partition pruning function considered a tuple containing a = 2 to fall in
this partition, which is wrong because the minvalue in its upper bound
makes the partition exclusive of any tuples with a = 2.  Fixed that.

Beside fixing that, I have decided to get rid of the
PartititionPruneStepOpNe (a special kind of base pruning step that was
being used to prune list partitions using a set of <> operator clauses)
and related functions.  Instead pruning for <> operator clauses is now
implemented by using a combination of PartitionPruneStepOp and
PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I
also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and
COMBINE_INTERSECT, respectively).  I decided to do so because the previous
arrangement looked like a "hack" to support a special case that touched no
less than quite a few places.

Attached find the updated version of patches.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

29 марта 2018 г., 17:08:31

On 29 March 2018 at 21:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Beside fixing that, I have decided to get rid of the
> PartititionPruneStepOpNe (a special kind of base pruning step that was
> being used to prune list partitions using a set of <> operator clauses)
> and related functions.  Instead pruning for <> operator clauses is now
> implemented by using a combination of PartitionPruneStepOp and
> PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I
> also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and
> COMBINE_INTERSECT, respectively).  I decided to do so because the previous
> arrangement looked like a "hack" to support a special case that touched no
> less than quite a few places.

Hi Amit,

I've looked at the v44 patch. Thanks for making those changes.

The new not-equal handling code is not quite right.

DROP TABLE listp;
CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);
CREATE TABLE listp_default PARTITION OF listp DEFAULT;

EXPLAIN SELECT * FROM listp WHERE a <> 1;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..54.56 rows=2537 width=4)
   ->  Seq Scan on listp1_3  (cost=0.00..41.88 rows=2537 width=4)
         Filter: (a <> 1)
(3 rows)

The default should be included here.

INSERT INTO listp VALUES(1),(2),(3);
SELECT * FROM listp WHERE a <> 1;
 a
---
 3
(1 row)

This code assumes its fine to just reverse the setting for default:

result->scan_default = !source->scan_default;

More complex handling is needed here.

I've attached a diff for a small set of other things I noticed while reviewing.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v44_drowley_review.patch

Re: [HACKERS] path toward faster partition pruning

От

Tomas Vondra

Дата:

29 марта 2018 г., 22:55:45

Hi,

I think there's a bug in generate_pruning_steps_from_opexprs, which does
this for PARTITION_STRATEGY_HASH:


    for_each_cell(lc1, lc)
    {
        pc = lfirst(lc1);

        /*
         * Note that we pass nullkeys for step_nullkeys,
         * because we need to tell hash partition bound search
         * function which of the keys are NULL.
         */
        Assert(pc->op_strategy == HTEqualStrategyNumber);
        pc_steps =
                get_steps_using_prefix(context,
                                       HTEqualStrategyNumber,
                                       pc->expr,
                                       pc->cmpfn,
                                       pc->keyno,
                                       nullkeys,
                                       prefix);
    }

    opsteps = list_concat(opsteps, list_copy(pc_steps));


Notice that the list_concat() is outside the for_each_cell loop. Doesn't
that mean we fail to consider some of the clauses (all except the very
last clause) for pruning? I haven't managed to come up with an example,
but I haven't tried very hard.

FWIW I've noticed this because gcc complains that pg_steps might be used
uninitialized:

partprune.c: In function ‘generate_partition_pruning_steps_internal’:
partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
      opsteps = list_concat(opsteps, list_copy(pc_steps));
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
partprune.c:936:14: note: ‘pc_steps’ was declared here
      List   *pc_steps;
              ^~~~~~~~
All of PostgreSQL successfully made. Ready to install.


So even if it's not a bug, we probably need to fix the code somehow.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

30 марта 2018 г., 10:24:19

Thanks Tomas for the review.

On 2018/03/30 1:55, Tomas Vondra wrote:
> Hi,
> 
> I think there's a bug in generate_pruning_steps_from_opexprs, which does
> this for PARTITION_STRATEGY_HASH:
> 
> 
>     for_each_cell(lc1, lc)
>     {
>         pc = lfirst(lc1);
> 
>         /*
>          * Note that we pass nullkeys for step_nullkeys,
>          * because we need to tell hash partition bound search
>          * function which of the keys are NULL.
>          */
>         Assert(pc->op_strategy == HTEqualStrategyNumber);
>         pc_steps =
>                 get_steps_using_prefix(context,
>                                        HTEqualStrategyNumber,
>                                        pc->expr,
>                                        pc->cmpfn,
>                                        pc->keyno,
>                                        nullkeys,
>                                        prefix);
>     }
> 
>     opsteps = list_concat(opsteps, list_copy(pc_steps));
> 
> 
> Notice that the list_concat() is outside the for_each_cell loop. Doesn't
> that mean we fail to consider some of the clauses (all except the very
> last clause) for pruning? I haven't managed to come up with an example,
> but I haven't tried very hard.
> 
> FWIW I've noticed this because gcc complains that pg_steps might be used
> uninitialized:
> 
> partprune.c: In function ‘generate_partition_pruning_steps_internal’:
> partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in
> this function [-Wmaybe-uninitialized]
>       opsteps = list_concat(opsteps, list_copy(pc_steps));
>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> partprune.c:936:14: note: ‘pc_steps’ was declared here
>       List   *pc_steps;
>               ^~~~~~~~
> All of PostgreSQL successfully made. Ready to install.
> 
> 
> So even if it's not a bug, we probably need to fix the code somehow.

Yeah, the code needs to be fixed.  Although, it seems to me that in the
hash partitioning case, the loop would iterate at most once, at least if
the query didn't contain any Params.  That's because, at that point, there
cannot be multiple mutually AND'd equality clauses referring to the same
key.  For example, if there were in the original query and they contained
different values, we wouldn't get this far anyway as they would be reduced
to constant-false at an earlier planning stage.  If they all contained the
same value (e.g. key = 1 and key = 1::smallint and a = 1::int and a =
1::bigint), then only one of them will be left in rel->baserestrictinfo
anyway.  But we still need to have the loop because all of what I said
wouldn't happen if the clauses contained Params.  In that case, the result
would be determined at execution time.

I have fixed the code as you suggested and will post the fixed version
shortly after fixing the issues David reported.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

30 марта 2018 г., 11:38:44

Hi David.

On 2018/03/29 20:08, David Rowley wrote:
> I've looked at the v44 patch.

Thank you.

> Thanks for making those changes.
> 
> The new not-equal handling code is not quite right.
> 
> DROP TABLE listp;
> CREATE TABLE listp (a INT) PARTITION BY LIST(a);
> CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);
> CREATE TABLE listp_default PARTITION OF listp DEFAULT;
> 
> EXPLAIN SELECT * FROM listp WHERE a <> 1;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  Append  (cost=0.00..54.56 rows=2537 width=4)
>    ->  Seq Scan on listp1_3  (cost=0.00..41.88 rows=2537 width=4)
>          Filter: (a <> 1)
> (3 rows)
> 
> The default should be included here.
> 
> INSERT INTO listp VALUES(1),(2),(3);
> SELECT * FROM listp WHERE a <> 1;
>  a
> ---
>  3
> (1 row)

Good catch!  Indeed, the default partition should not have been pruned
away in this case.

> This code assumes its fine to just reverse the setting for default:
> 
> result->scan_default = !source->scan_default;
> 
> More complex handling is needed here.

Hmm, I thought about this and came to a conclusion that we should *always*
scan the default partition in this case.  Inversion step removes all the
datums selected by the source step from the set of *all* datums that the
currently defined set of partitions allow.  If there's a default partition
in the mix, that means the latter contains all the datums of the partition
key's data type.  Irrespective of whether or not the source step selected
the default partition, there would be datums that would be in the set
after inversion which in turn would be in the default partition, if not in
some non-default partition that would've been selected.  I have written a
comment there trying to explain this, but I may not have been able to
articulate it properly.  Please check.  Or does this sound just wrong?

> I've attached a diff for a small set of other things I noticed while
reviewing.

Thanks, merged.

Please find attached the updated patches.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

30 марта 2018 г., 18:18:06

On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached the updated patches.

Thanks.

I've noticed that there are no outfuncs or readfuncs for all the new
Step types you've added.

Also, the copy func does not properly copy the step_id in the base
node type. This will remain at 0 after a copyObject()


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

30 марта 2018 г., 19:00:14

On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've noticed that there are no outfuncs or readfuncs for all the new
> Step types you've added.
>
> Also, the copy func does not properly copy the step_id in the base
> node type. This will remain at 0 after a copyObject()

I've attached a quickly put together fix for this. I'm not quite sure
what the done thing is to copy/read/write nodes which inherit fields
from other nodes, so what I've done in the attached might not be
correct. However, I see _copyValue() does what I've done, so perhaps
its fine, or that may just be more of a special case.

I also manually removed some hunks from the diff, so hopefully, it
still works correctly.

Attaching it as it may save you some time from doing it yourself.
Please check it though.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

copy_read_write_func_fixes.patch

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

30 марта 2018 г., 19:33:30

David Rowley wrote:

> Also, the copy func does not properly copy the step_id in the base
> node type. This will remain at 0 after a copyObject()

As I recall, you can find these mistakes by compiling with
-DCOPY_PARSE_PLAN_TREES.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

30 марта 2018 г., 19:41:02

On 31 March 2018 at 02:00, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've noticed that there are no outfuncs or readfuncs for all the new
>> Step types you've added.
>>
>> Also, the copy func does not properly copy the step_id in the base
>> node type. This will remain at 0 after a copyObject()
>
> Attaching it as it may save you some time from doing it yourself.
> Please check it though.

The attached might be slightly easier to apply. The previous version
was based on top of some other changes I'd been making.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

copy_read_write_func_fixes_v2.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

30 марта 2018 г., 21:55:44

On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached the updated patches.

There's a bit of a strange case with v45 around prepared statements.
I've not debugged this yet, but in case you get there first, here's
the case:

create table listp (a int, b int) partition by list (a);
create table listp_1 partition of listp for values in(1) partition by list (b);
create table listp_1_1 partition of listp_1 for values in(1);
create table listp_2 partition of listp for values in(2) partition by list (b);
create table listp_2_1 partition of listp_2 for values in(2);

explain select * from listp where b in(1,2) and 2<>b and 0<>b; -- this
one looks fine.
                                 QUERY PLAN
----------------------------------------------------------------------------
 Append  (cost=0.00..49.66 rows=22 width=8)
   ->  Seq Scan on listp_1_1  (cost=0.00..49.55 rows=22 width=8)
         Filter: ((b = ANY ('{1,2}'::integer[])) AND (2 <> b) AND (0 <> b))
(3 rows)

prepare q1 (int,int,int,int) as select * from listp where b in($1,$2)
and $3 <> b and $4 <> b;
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
           QUERY PLAN
--------------------------------
 Result (actual rows=0 loops=1)
   One-Time Filter: false
(2 rows)

My best guess is that something ate the bits out of a Bitmapset of the
matching partitions somewhere.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

02 апреля 2018 г., 07:42:47

On 2018/03/30 22:41, David Rowley wrote:
> On 31 March 2018 at 02:00, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> I've noticed that there are no outfuncs or readfuncs for all the new
>>> Step types you've added.
>>>
>>> Also, the copy func does not properly copy the step_id in the base
>>> node type. This will remain at 0 after a copyObject()
>>
>> Attaching it as it may save you some time from doing it yourself.
>> Please check it though.
> 
> The attached might be slightly easier to apply. The previous version
> was based on top of some other changes I'd been making.

Thanks David.  I have merged this.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

02 апреля 2018 г., 11:18:04

Hi David.

On 2018/03/31 0:55, David Rowley wrote:
> On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Please find attached the updated patches.
> 
> There's a bit of a strange case with v45 around prepared statements.
> I've not debugged this yet, but in case you get there first, here's
> the case:
> 
> create table listp (a int, b int) partition by list (a);
> create table listp_1 partition of listp for values in(1) partition by list (b);
> create table listp_1_1 partition of listp_1 for values in(1);
> create table listp_2 partition of listp for values in(2) partition by list (b);
> create table listp_2_1 partition of listp_2 for values in(2);
> 
> explain select * from listp where b in(1,2) and 2<>b and 0<>b; -- this
> one looks fine.
>                                  QUERY PLAN
> ----------------------------------------------------------------------------
>  Append  (cost=0.00..49.66 rows=22 width=8)
>    ->  Seq Scan on listp_1_1  (cost=0.00..49.55 rows=22 width=8)
>          Filter: ((b = ANY ('{1,2}'::integer[])) AND (2 <> b) AND (0 <> b))
> (3 rows)
> 
> prepare q1 (int,int,int,int) as select * from listp where b in($1,$2)
> and $3 <> b and $4 <> b;
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>            QUERY PLAN
> --------------------------------
>  Result (actual rows=0 loops=1)
>    One-Time Filter: false
> (2 rows)
> 
> My best guess is that something ate the bits out of a Bitmapset of the
> matching partitions somewhere.

Hmm.  It is the newly added inversion step that's causing this.  When
creating a generic plan (that is when the planning happens via
BuildCachedPlan called with boundParams set to NULL), the presence of
Params will cause an inversion step's source step to produce
scan-all-partitions sort of result, which the inversion step dutifully
inverts to a scan-no-partitions result.

I have tried to attack that problem by handling the
no-values-to-prune-with case using a side-channel to propagate the
scan-all-partitions result through possibly multiple steps.  That is, a
base pruning step will set datum_offsets in a PruneStepResult only if
pruning is carried out by actually comparing values with the partition
bounds.  If no values were provided (like in the generic plan case), it
will set a scan_all_nonnull flag instead and return without setting
datum_offsets.  Combine steps perform their combining duty only if
datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.

Attached updated version of the patches.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

02 апреля 2018 г., 18:03:15

On 2 April 2018 at 17:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/03/31 0:55, David Rowley wrote:
>> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>>            QUERY PLAN
>> --------------------------------
>>  Result (actual rows=0 loops=1)
>>    One-Time Filter: false
>> (2 rows)
>
> Hmm.  It is the newly added inversion step that's causing this.  When
> creating a generic plan (that is when the planning happens via
> BuildCachedPlan called with boundParams set to NULL), the presence of
> Params will cause an inversion step's source step to produce
> scan-all-partitions sort of result, which the inversion step dutifully
> inverts to a scan-no-partitions result.
>
> I have tried to attack that problem by handling the
> no-values-to-prune-with case using a side-channel to propagate the
> scan-all-partitions result through possibly multiple steps.  That is, a
> base pruning step will set datum_offsets in a PruneStepResult only if
> pruning is carried out by actually comparing values with the partition
> bounds.  If no values were provided (like in the generic plan case), it
> will set a scan_all_nonnull flag instead and return without setting
> datum_offsets.  Combine steps perform their combining duty only if
> datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.

I'm afraid this is still not correct :-(

The following code is not doing the right thing:

+ case COMBINE_UNION:
+ foreach(lc1, cstep->source_stepids)
+ {
+ int step_id = lfirst_int(lc1);
+ PruneStepResult *step_result;
+
+ /*
+ * step_results[step_id] must contain a valid result,
+ * which is confirmed by the fact that cstep's step_id is
+ * greater than step_id and the fact that results of the
+ * individual steps are evaluated in sequence of their
+ * step_ids.
+ */
+ if (step_id >= cstep->step.step_id)
+ elog(ERROR, "invalid pruning combine step argument");
+ step_result = step_results[step_id];
+ Assert(step_result != NULL);
+
+ result->scan_all_nonnull = step_result->scan_all_nonnull;

The last line there is not properly performing a union, it just sets
the result_scan_all_nonnull to whatever the last step's value was.

At the very least it should be |= but I don't really like this new code.

Why did you move away from just storing the matching partitions in a
Bitmapset? If you want to store all non-null partitions, then why not
just set the bits for all non-null partitions? That would cut down on
bugs like this since the combining of step results would just be
simple unions or intersects.

Also, the following code could be made a bit nicer

+ result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
+
+ switch (context->strategy)
+ {
+ case PARTITION_STRATEGY_HASH:
+ result->bound_offsets = get_matching_hash_bound(context,
+ opstep->opstrategy,
+ values, nvalues,
+ partsupfunc,
+ opstep->nullkeys,
+ &result->scan_all_nonnull);

Why not allocate the PruneStepResult inside the get_matching_*_bound,
that way you wouldn't need all those out parameters to set the bool
fields.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

03 апреля 2018 г., 18:02:50

Hi David.

On 2018/04/02 21:03, David Rowley wrote:
> On 2 April 2018 at 17:18, Amit Langote wrote:
>> On 2018/03/31 0:55, David Rowley wrote:
>>> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>>>            QUERY PLAN
>>> --------------------------------
>>>  Result (actual rows=0 loops=1)
>>>    One-Time Filter: false
>>> (2 rows)
>>
>> Hmm.  It is the newly added inversion step that's causing this.  When
>> creating a generic plan (that is when the planning happens via
>> BuildCachedPlan called with boundParams set to NULL), the presence of
>> Params will cause an inversion step's source step to produce
>> scan-all-partitions sort of result, which the inversion step dutifully
>> inverts to a scan-no-partitions result.
>>
>> I have tried to attack that problem by handling the
>> no-values-to-prune-with case using a side-channel to propagate the
>> scan-all-partitions result through possibly multiple steps.  That is, a
>> base pruning step will set datum_offsets in a PruneStepResult only if
>> pruning is carried out by actually comparing values with the partition
>> bounds.  If no values were provided (like in the generic plan case), it
>> will set a scan_all_nonnull flag instead and return without setting
>> datum_offsets.  Combine steps perform their combining duty only if
>> datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.
> 
> I'm afraid this is still not correct :-(
> 
> The following code is not doing the right thing:
> 
> + case COMBINE_UNION:
> + foreach(lc1, cstep->source_stepids)
> + {
> + int step_id = lfirst_int(lc1);
> + PruneStepResult *step_result;
> +
> + /*
> + * step_results[step_id] must contain a valid result,
> + * which is confirmed by the fact that cstep's step_id is
> + * greater than step_id and the fact that results of the
> + * individual steps are evaluated in sequence of their
> + * step_ids.
> + */
> + if (step_id >= cstep->step.step_id)
> + elog(ERROR, "invalid pruning combine step argument");
> + step_result = step_results[step_id];
> + Assert(step_result != NULL);
> +
> + result->scan_all_nonnull = step_result->scan_all_nonnull;
> 
> The last line there is not properly performing a union, it just sets
> the result_scan_all_nonnull to whatever the last step's value was.>
> At the very least it should be |= but I don't really like this new code.
> 
> Why did you move away from just storing the matching partitions in a
> Bitmapset? If you want to store all non-null partitions, then why not
> just set the bits for all non-null partitions? That would cut down on
> bugs like this since the combining of step results would just be
> simple unions or intersects.

As I mentioned in my previous email, I had to find a side-channel (that is
scan_all_nonnull) to store this information instead of doing it the
regular way, to differentiate the case where we need to scan all
partitions because of values in the base prune steps not being available
from the case where carrying out a step using actual values ends up
selecting all partitions.  When creating a generic plan, values of none of
the Params that are added to base prune steps are available and that
results in reaching the actual pruning functions
(get_matching_hash/list/range_bounds) without any values, which results in
each of those functions, in returning all partitions containing non-null data.

But actually, the presence of only Params in the pruning steps should
result in the pruning not being invoked at all (at least for the static
pruning case), thus selecting all partitions containing non-null data.  It
is better to implement that instead of a workaround like scan_all_nonnulls
side-channel I was talking about.

Fixed the patch to implement it that way.

> Also, the following code could be made a bit nicer
> 
> + result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
> +
> + switch (context->strategy)
> + {
> + case PARTITION_STRATEGY_HASH:
> + result->bound_offsets = get_matching_hash_bound(context,
> + opstep->opstrategy,
> + values, nvalues,
> + partsupfunc,
> + opstep->nullkeys,
> + &result->scan_all_nonnull);
> 
> Why not allocate the PruneStepResult inside the get_matching_*_bound,
> that way you wouldn't need all those out parameters to set the bool
> fields.

I thought it'd be nice to have perform_pruning_base_step generate the
actual PruneStepResult instead of the functions for individual
partitioning strategies, which in a way minimizes places where it is
manipulated.  Since, we've divided bound searching into 3 separate
functions anyway, it also seemed better to me to have their signatures be
relevant to the partition strategy they cater to.  The function for hash
partitioning, for example, never has to deal with setting the result for
the null or the default partition and the range partitioning function
doesn't have to worry about doing anything about for null partition.
Also, overall footprint of those 3 functions reduced because they don't
have to create the PruneStepResult themselves.

Attached v47.

Thanks,
Amit

On 4 April 2018 at 11:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I think it would be better to just have special handling in
>> get_matching_list_bound so that it knows it's performing <>
>> elimination. I'd thought about passing some other opstrategy but the
>> only safe one I thought to use was InvalidStrategy, which is already
>> used by NULL handling.
>
> I'm currently working up a patch to do this the way I think is best.
>
> I'll submit it soon and we can review and get your thoughts on it.

I've attached a rough cut version of what I think is a good solution
for this. It's based on v46, not your latest v47, sorry.

This makes get_matching_list_bounds() aware that it's performing the
not-equal pruning via the opstrategy which allows it to not return all
partitions when there are no values in this case. Instead, we return
the NULL partition, so that we later invert that and return everything
apart from the NULL partition. A strict clause will allow us that
much, even if we can't get the actual value being compared to, at the
time.

There's also a bunch of other changes in there:

1. Adding missing step_id in copyfuncs.c
2. Simplified including the default partition in a bunch of cases.
3. Made it so scan_default and scan_null are only ever set to true if
there's a partition for that.
4. Changed get_matching_*_bounds to return the entire result struct
instead of the Bitmapset and pass the remaining bool values back
through params. I didn't really like how you'd change this to pass all
the bool flags back as params. There's a perfectly good struct there
to provide the entire result in a single return value. I know you've
disagreed with this already, so would be nice to get a 3rd opinion.
5. Rename get_matching_hash_bound to get_matching_hash_bounds. The
LIST and RANGE version of this function both had a plural name. I
didn't see any reason for the hash case to be different.

Let me know what you think.

I've patched the run-time pruning v18 against this and it now passes regression.

I need to do a bit more testing on this to ensure it works for all
cases, but thought I'd send now as I suspect you're currently around
to look.

There might be another issue with the patch too, but I'll send a
separate email about that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v46_fixes_drowley.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

04 апреля 2018 г., 09:21:06

On 4 April 2018 at 13:13, David Rowley <david.rowley@2ndquadrant.com> wrote:
> There might be another issue with the patch too, but I'll send a
> separate email about that.

In the current version of the patch the following comment exists:

/*
* Fall-through for a NOT clause, which is handled in
* match_clause_to_partition_key().
*/

The only real handling of NOT clauses is in
match_boolean_partition_clause() which just handles NOT(true) or
NOT(false).

It's true that the const simplification code will generally rewrite
most NOT(clause) to use the negator operator, but if the operator does
not have a negator it can't do this.

We probably don't have any built-in operators which are members of a
btree opclass which have no negator, but it's simple enough to modify
the citext extension by commenting out the NEGATOR lines in
citext--1.4--1.5.sql.

create extension citext;
create table listp(a citext) partition by list(a citext_pattern_ops);
create table listp_1 partition of listp for values in('1');
explain select * from listp where not (a ~>~ '0' and  a ~<~ '2');
                                QUERY PLAN
--------------------------------------------------------------------------
 Append  (cost=0.00..36.45 rows=1209 width=32)
   ->  Seq Scan on listp_1  (cost=0.00..30.40 rows=1209 width=32)
         Filter: ((NOT (a ~>~ '0'::citext)) OR (NOT (a ~<~ '2'::citext)))
(3 rows)

At the moment pruning does not work for this case at all. Perhaps it should?

I imagine it might be possible to re-work the COMBINE_INVERT code so
that it becomes a flag of the combine step rather than a step operator
type. It should then be possible to invert COMBINE_UNION for
NOT(clause1 OR clause2) and COMBINE_INTERSECT on NOT(clause1 AND
clause2).

IOW, it might not take too many lines of code to put this right.
Probably the bulk of the work would be writing a test with a btree
opclass that will allow us to have the planner not invert the clause
during const folding.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Tom Lane

Дата:

04 апреля 2018 г., 10:00:13

David Rowley <david.rowley@2ndquadrant.com> writes:
> It's true that the const simplification code will generally rewrite
> most NOT(clause) to use the negator operator, but if the operator does
> not have a negator it can't do this.
> ...
> At the moment pruning does not work for this case at all. Perhaps it should?

It's hard to see why we'd expend extra effort to optimize such situations.
The right answer would invariably be to fix the inadequate operator
definition, because missing the negator link would hobble many other
cases besides this.

Now if you can show a case where the extra smarts would be useful
without presuming a badly-written opclass, it's a different matter.

            regards, tom lane

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

04 апреля 2018 г., 10:03:31

On 4 April 2018 at 16:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> It's true that the const simplification code will generally rewrite
>> most NOT(clause) to use the negator operator, but if the operator does
>> not have a negator it can't do this.
>> ...
>> At the moment pruning does not work for this case at all. Perhaps it should?
>
> It's hard to see why we'd expend extra effort to optimize such situations.
> The right answer would invariably be to fix the inadequate operator
> definition, because missing the negator link would hobble many other
> cases besides this.
>
> Now if you can show a case where the extra smarts would be useful
> without presuming a badly-written opclass, it's a different matter.

Okay, well that certainly sounds like less work.

In that case, the comment which claims we handle the NOT clauses needs
to be updated to mention that we only handle boolean NOT clauses and
don't optimize the remainder.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

04 апреля 2018 г., 11:42:58

Hi David.

On 2018/04/04 10:13, David Rowley wrote:
> On 4 April 2018 at 11:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> I think it would be better to just have special handling in
>>> get_matching_list_bound so that it knows it's performing <>
>>> elimination. I'd thought about passing some other opstrategy but the
>>> only safe one I thought to use was InvalidStrategy, which is already
>>> used by NULL handling.

Thanks for this suggestion.

Having the special case handling for steps corresponding to <> operator
clauses in get_matching_list_bounds() seems like the best way and that
should have been the way all along.  It occurred to me that after I
changed the patch to store datum offsets in the result, there wasn't any
need for special handling of <> operators at a higher level -- like the
special pruning function (get_partitions_excluded_by_ne_datums) that we
used to have or the COMBINE_INVERT I recently proposed.

For each datum coming from a <> operator clause (signaled to
get_matching_list_bounds by passing InvalidStrategy for opstrategy), we
return all datums minus the one that was passed (if the latter is indeed
found in boundinfo).  Bounds for individual <> operator clauses will be
combined using INTERSECT at a higher level to give the desired result.  No
need for the invert step and for the planner to set things up very
carefully for invert step to do the right thing.

create table lp (a int) partition by list (a);
create table lp1 partition of lp for values in (1);
create table lp2 partition of lp for values in (2);
create table lp3 partition of lp for values in (3);
create table lp_null partition of lp for values in (null);
create table lp_default partition of lp default;

For

explain select * from lp where a <> 1

get_matching_list_bounds will returns the set of offsets of {2, 3} and
will set scan_default, while setting scan_null false;

and for

explain select * from lp where a <> 1 and a <> 3

it will returns the set of offsets of {2, 3} and {1, 2} for the individual
base steps and along setting scan_default and setting scan_null to false;
the INTERSECT combination step still combine those to give the offset of 2
with scan_default set to true and scan_null set to false.

                             QUERY PLAN
--------------------------------------------------------------------
 Append  (cost=0.00..121.75 rows=5050 width=4)
   ->  Seq Scan on lp2  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 3))
   ->  Seq Scan on lp_default  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 3))

If there are no values then the offsets of all the bounds will be returned
and unlike the previous setup with the INVERT step in the mix, they will
all survive.

I know we've been back and forth quite a bit on this, but this solution
seems like the one with the least amount of hackery.  Hope you find it to
be the same way.

>> I'm currently working up a patch to do this the way I think is best.
>>
>> I'll submit it soon and we can review and get your thoughts on it.
> 
> I've attached a rough cut version of what I think is a good solution
> for this. It's based on v46, not your latest v47, sorry.
> 
> This makes get_matching_list_bounds() aware that it's performing the
> not-equal pruning via the opstrategy which allows it to not return all
> partitions when there are no values in this case. Instead, we return
> the NULL partition, so that we later invert that and return everything
> apart from the NULL partition. A strict clause will allow us that
> much, even if we can't get the actual value being compared to, at the
> time.

As I explained above, I considered your general idea of teaching
get_matching_list_bounds to deal with being passed InvalidStrategy for
opstrategy to signal special handling of a <> clause datum.

By implementing that, I was able to get rid of a bunch of code in
partprune.c and remove the COMBINE_INVERT related code.  We can add
COMBINE_INVERT later if and when we need it (for some legitimate purpose).

> There's also a bunch of other changes in there:

Thanks.

> 1. Adding missing step_id in copyfuncs.c

Merged.

> 2. Simplified including the default partition in a bunch of cases.
> 3. Made it so scan_default and scan_null are only ever set to true if
> there's a partition for that.

I have merged these too.

> 4. Changed get_matching_*_bounds to return the entire result struct
> instead of the Bitmapset and pass the remaining bool values back
> through params. I didn't really like how you'd change this to pass all
> the bool flags back as params. There's a perfectly good struct there
> to provide the entire result in a single return value. I know you've
> disagreed with this already, so would be nice to get a 3rd opinion.

I went ahead with them returning PruneStepResult struct.

> 5. Rename get_matching_hash_bound to get_matching_hash_bounds. The
> LIST and RANGE version of this function both had a plural name. I
> didn't see any reason for the hash case to be different.

Agreed, merged.

> Let me know what you think.

I'm not sure about the following change in your patch:

-        if (!result->scan_null)
-            result->scan_null = step_result->scan_null;
-        if (!result->scan_default)
-            result->scan_default = step_result->scan_default;
+        result->scan_null |= step_result->scan_null;
+        result->scan_default |= step_result->scan_default;

Afaik, |= does bitwise OR, which even if it might give the result we want,
is not a logical operation.  I had written the original code using the
following definition of logical OR.

  a OR b = if a then true else b

Also, since things work normally even if we pass no values to
get_matching_list_bounds, including via the dummy step generated for IS
NOT NULL clause(s), I don't see the need to store notnullkeys in the prune
step.  Especially, it's redundant to set notnullkeys in the pruning step
containing non-empty exprs since, by definition, they will select
partitions containing non-null datums.

> I've patched the run-time pruning v18 against this and it now passes regression.
> 
> I need to do a bit more testing on this to ensure it works for all
> cases, but thought I'd send now as I suspect you're currently around
> to look.

See if attached works for you.

> There might be another issue with the patch too, but I'll send a
> separate email about that.

I suppose this is the email about support for pruning using NOT clauses in
partition.c.  It might be possible to do that by tweaking things somehow
by re-introducing the COMBINE_INVERT step (legitimately needed in that
case) and modifying partprune.c to capture NOT clauses in more cases than
it does currently.

Although, I modified the comment like you suggested that we only support
Boolean NOT clause in special cases like when using Boolean partitioning
opfamily.

Attached v48.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

04 апреля 2018 г., 13:04:52

On 2018/04/04 14:42, Amit Langote wrote:
> Attached v48.

I had forgotten to remove the static_pruning parameter I had added in the
v47, because it is no longer used.  Static pruning now occurs even if a
step contains all Params, in which case each of
get_matching_hash/list/range_bounds() functions returns offsets of all
non-null datums, because the Params cannot be resolved to actual values
during static pruning.

Also, a few changes to get_matching_partitions that David had proposed in
his delta patch but I had failed to include them in v48.

Attached v49.

Thanks,
Amit

On 4 April 2018 at 19:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/04 14:42, Amit Langote wrote:
>> Attached v48.
>
> I had forgotten to remove the static_pruning parameter I had added in the
> v47, because it is no longer used.  Static pruning now occurs even if a
> step contains all Params, in which case each of
> get_matching_hash/list/range_bounds() functions returns offsets of all
> non-null datums, because the Params cannot be resolved to actual values
> during static pruning.

Thanks for updating. I've made a pass over v49 and I didn't find very
much wrong with it.

The only real bug I found was a missing IsA(rinfo->clause, Const) in
the pseudoconstant check inside
generate_partition_pruning_steps_internal.

Most of the changes are comment fixes with a few stylistic changes
thrown which are pretty much all there just to try to shrink the code
a line or two or reduce indentation.

I feel pretty familiar with this code now and assuming the attached is
included I'm happy for someone else, hopefully, a committer to take a
look at it.

I'll leave the following notes:

1. Still not sure about RelOptInfo->has_default_part. This flag is
only looked at in generate_partition_pruning_steps. The RelOptInfo and
the boundinfo is available to look at, it's just that the
partition_bound_has_default macro is defined in partition.c rather
than partition.h.

2. Don't really like the new isopne variable name. It's not very
simple to decode, perhaps something like is_not_eq is better?

3. The part of the code I'm least familiar with is
get_steps_using_prefix_recurse(). I admit to not having had time to
fully understand that and consider ways to break it.

Marking as ready for committer.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

v49_fixes_drowley.patch

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

04 апреля 2018 г., 21:45:56

Hi,

On 04/04/2018 09:29 AM, David Rowley wrote:
> Thanks for updating. I've made a pass over v49 and I didn't find very
> much wrong with it.
> 
> The only real bug I found was a missing IsA(rinfo->clause, Const) in
> the pseudoconstant check inside
> generate_partition_pruning_steps_internal.
> 
> Most of the changes are comment fixes with a few stylistic changes
> thrown which are pretty much all there just to try to shrink the code
> a line or two or reduce indentation.
> 
> I feel pretty familiar with this code now and assuming the attached is
> included I'm happy for someone else, hopefully, a committer to take a
> look at it.
> 
> I'll leave the following notes:
> 
> 1. Still not sure about RelOptInfo->has_default_part. This flag is
> only looked at in generate_partition_pruning_steps. The RelOptInfo and
> the boundinfo is available to look at, it's just that the
> partition_bound_has_default macro is defined in partition.c rather
> than partition.h.
> 
> 2. Don't really like the new isopne variable name. It's not very
> simple to decode, perhaps something like is_not_eq is better?
> 
> 3. The part of the code I'm least familiar with is
> get_steps_using_prefix_recurse(). I admit to not having had time to
> fully understand that and consider ways to break it.
> 
> Marking as ready for committer.
> 

Passes check-world, and CommitFest app has been updated to reflect the 
current patch set. Trivial changes attached.

Best regards,
  Jesper

Вложения

v49_fixes_jpedersen.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

05 апреля 2018 г., 14:28:56

Hi.

On 2018/04/05 0:45, Jesper Pedersen wrote:
> Hi,
> 
> On 04/04/2018 09:29 AM, David Rowley wrote:
>> Thanks for updating. I've made a pass over v49 and I didn't find very
>> much wrong with it.
>>
>> The only real bug I found was a missing IsA(rinfo->clause, Const) in
>> the pseudoconstant check inside
>> generate_partition_pruning_steps_internal.

Fixed.

>> Most of the changes are comment fixes with a few stylistic changes
>> thrown which are pretty much all there just to try to shrink the code
>> a line or two or reduce indentation.
>>
>> I feel pretty familiar with this code now and assuming the attached is
>> included I'm happy for someone else, hopefully, a committer to take a
>> look at it.

Thank you, your changes look good to me.

>> I'll leave the following notes:
>>
>> 1. Still not sure about RelOptInfo->has_default_part. This flag is
>> only looked at in generate_partition_pruning_steps. The RelOptInfo and
>> the boundinfo is available to look at, it's just that the
>> partition_bound_has_default macro is defined in partition.c rather
>> than partition.h.

Hmm, it might not be such a bad idea to bring out the
PartitionBoundInfoData into partition.h.  If we do that, we won't need the
has_default_part that the patch adds to RelOptInfo.

In the Attached v50 set, 0002 does that.

>> 2. Don't really like the new isopne variable name. It's not very
>> simple to decode, perhaps something like is_not_eq is better?

isopne does sound a bit unintelligible.  I propose op_is_ne so that it
sounds consistent with the preceding member of the struct that's called
opno.  I want to keep "ne" and not start calling it not_eq, as a few other
places use the string "ne" to refer to a similar thing, like:

/* inequality */
Datum
range_ne(PG_FUNCTION_ARGS)

Datum
timestamptz_ne_date(PG_FUNCTION_ARGS)

Since the field is local to partprune.c, I guess that it's fine as the
comment where it's defined tells what it is.

>> 3. The part of the code I'm least familiar with is
>> get_steps_using_prefix_recurse(). I admit to not having had time to
>> fully understand that and consider ways to break it.

The purpose of that code is to generate *all* needed steps to be combined
using COMBINE_INTERSECT such that the pruning will occur using the most
restrictive set of clauses in cases where the same key is referenced in
multiple restriction clauses containing non-equality operators.  So, for a
range partitioned table on (a, b):

For a query like

explain select * from foo a <= 1 and a <= 3 and b < 5 and b <= 10

Pruning steps generated to be combined with an enclosing INTERSECT step
will be as follows:

<= (1, 10)
<  (1, 5)
<= (3, 10)
<  (3, 5)

>> Marking as ready for committer.

Thank you!

> Passes check-world, and CommitFest app has been updated to reflect the
> current patch set. Trivial changes attached.

Merged these changes.  Thanks again Jesper.

Attached v50.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

05 апреля 2018 г., 14:31:41

On 2018/04/05 17:28, Amit Langote wrote:
> Attached v50.

Really attached this time.

Thanks,
Amit

Вложения

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

05 апреля 2018 г., 20:26:35

Amit Langote wrote:

> >> 1. Still not sure about RelOptInfo->has_default_part. This flag is
> >> only looked at in generate_partition_pruning_steps. The RelOptInfo and
> >> the boundinfo is available to look at, it's just that the
> >> partition_bound_has_default macro is defined in partition.c rather
> >> than partition.h.
> 
> Hmm, it might not be such a bad idea to bring out the
> PartitionBoundInfoData into partition.h.  If we do that, we won't need the
> has_default_part that the patch adds to RelOptInfo.
> 
> In the Attached v50 set, 0002 does that.

After looking at this for a moment, I again come to the conclusion that
the overall layout of partitioning code and definitions is terrible.
But we already know that, and there's a patch in commitfest to improve
things.  So my intention right now is to hold my nose and get this
pushed; we'll fix it afterwards.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 04:35:12

I seems pretty clear that putting get_matching_partitions() in
catalog/partition.c is totally the wrong thing; it belongs wholly in
partprune. I think the reason you put it there is that it requires
access to a lot of internals that are static in partition.c.  In the
attached not yet cleaned version of the patch, I have moved a whole lot
of what you added to partition.c to partprune.c; and for the functions
and struct declarations that were required to make it work, I created
catalog/partition_internal.h.

I changed a lot of code also, but cosmetic changes only.

I'll clean this up a bit more now, and try to commit shortly (or early
tomorrow); wanted to share current status now in case I have to rush
out.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

fastprune.patch

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 05:33:55

> @@ -1717,8 +1691,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
>       * parentrte already has the root partrel's updatedCols translated to match
>       * the attribute ordering of parentrel.
>       */
> -    if (!*part_cols_updated)
> -        *part_cols_updated =
> +    if (!root->partColsUpdated)
> +        root->partColsUpdated =
>              has_partition_attrs(parentrel, parentrte->updatedCols, NULL);

Hmm, surely this should be |= to avoid resetting a value set in a
previous call to this function?  In the previous coding it wasn't
necessary because it was a local variable ...  (though, isn't it a bit
odd to have this in PlannerInfo?  seems like it should be in
resultRelInfo, but then you already have it there so I suppose this one
does *more*)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

06 апреля 2018 г., 06:02:11

On 6 April 2018 at 10:35, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I changed a lot of code also, but cosmetic changes only.
>
> I'll clean this up a bit more now, and try to commit shortly (or early
> tomorrow); wanted to share current status now in case I have to rush
> out.

I made a complete pass over the patch you sent. I only noted down the
following few things:

1.

+ * off < 0, meaning the look-up value is smaller that all bounds,

that -> than

2. I guess this will be removed before commit.

+#if 0
<large section of comments>
+#endif

3. This comment seems like a strange thing to write just before
testing if the clause matches the partition key.

+ /* Clause does not match this partition key. */
+ if (equal(leftop, partkey))
+ *rightop = not_clause((Node *) clause)
+ ? (Expr *) makeBoolConst(false, false)
+ : (Expr *) makeBoolConst(true, false);

4. Comment needs removed.

+ * has_default_part - Whether the table has a default partition

The only other thing I noted on this pass is that we could get rid of:

+ /* go check the next clause. */
+ if (unsupported_clause)
+ break;

and just "continue" instead of "break" in all cases apart from case
PARTCLAUSE_UNSUPPORTED:

it would save a few lines and a single condition. What's there works,
but thought this might be better...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

06 апреля 2018 г., 06:06:08

On 6 April 2018 at 12:02, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 6 April 2018 at 10:35, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> The only other thing I noted on this pass is that we could get rid of:
>
> + /* go check the next clause. */
> + if (unsupported_clause)
> + break;
>
> and just "continue" instead of "break" in all cases apart from case
> PARTCLAUSE_UNSUPPORTED:

I should have said remove:

+ if (unsupported_clause)

The "break" would still be required.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

06 апреля 2018 г., 07:33:17

Hi.

On 2018/04/06 7:35, Alvaro Herrera wrote:
> I seems pretty clear that putting get_matching_partitions() in
> catalog/partition.c is totally the wrong thing; it belongs wholly in
> partprune. I think the reason you put it there is that it requires
> access to a lot of internals that are static in partition.c.  In the
> attached not yet cleaned version of the patch, I have moved a whole lot
> of what you added to partition.c to partprune.c; and for the functions
> and struct declarations that were required to make it work, I created
> catalog/partition_internal.h.

Yes, I really wanted for most of the new code that this patch adds to land
in the planner, especially after Robert's comments here:

https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com

It would've been nice if we'd gotten the "reorganizing partitioning code"
thread resolved sooner.

> I changed a lot of code also, but cosmetic changes only.
> 
> I'll clean this up a bit more now, and try to commit shortly (or early
> tomorrow); wanted to share current status now in case I have to rush
> out.

Some comments on the code reorganizing part of the patch:

* Did you intentionally not put PartitionBoundInfoData and its accessor
macros in partition_internal.h.  partprune.c would not need to include
partition.h if we do that.

* Also, I wonder why you left PartitionPruneContext in partition.h.  Isn't
it better taken out to partprune.h?

I have done that in the attached.

* Why isn't gen_partprune_steps() in partprune.h?  I see only
prune_append_rel_partitions() exported out of partprune.c, but the runtime
patch needs gen_partprune_steps() to be called from createplan.c.

* I don't see get_matching_partitions() exported either.  Runtime pruning
patch needs that too.

Maybe you've thought something about these two items though.

Thanks,
Amit

Вложения

fastprune-delta-partition-struct-movement.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

06 апреля 2018 г., 08:03:42

On 2018/04/06 8:33, Alvaro Herrera wrote:
>> @@ -1717,8 +1691,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
>>       * parentrte already has the root partrel's updatedCols translated to match
>>       * the attribute ordering of parentrel.
>>       */
>> -    if (!*part_cols_updated)
>> -        *part_cols_updated =
>> +    if (!root->partColsUpdated)
>> +        root->partColsUpdated =
>>              has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
> 
> Hmm, surely this should be |= to avoid resetting a value set in a
> previous call to this function?

It won't be, no?  We set it only if it hasn't been already.  Note that
there is one PlannerInfo per sub-query, so we determine this information
independently for each sub-query.

> In the previous coding it wasn't
> necessary because it was a local variable ...  (though, isn't it a bit
> odd to have this in PlannerInfo?  seems like it should be in
> resultRelInfo, but then you already have it there so I suppose this one
> does *more*)

Hmm, you'd think that we can figure this out in the executor itself, and
hence don't to have this in PlannerInfo or in ModifyTable.  But IIRC,
during the discussion of the update tuple routing patch, it became clear
that it's best do that here, given the way things are now wrt the timing
of partition/inheritance tree expansion.  An update query may modify the
partition key of a table at any arbitrary level and we have to look at all
the tables in the partition tree in this planning phase anyway, so it's
also the best time to see it if the query's modifiedCols overlaps with the
partition key of some table in the tree.  Once we've found that it does
for some table (most likely the root), we're done, that is, we know we got
some "partColsUpdated".

I realized that I had gotten rid of has_default_part from RelOptInfo but
hadn't deleted a comment about it; attached patch to fix that.

Thanks,
Amit

Вложения

fastprune-delta-remove-obsolete-comment.patch

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 18:24:34

Amit Langote wrote:
> Hi.
> 
> On 2018/04/06 7:35, Alvaro Herrera wrote:
> > I seems pretty clear that putting get_matching_partitions() in
> > catalog/partition.c is totally the wrong thing; it belongs wholly in
> > partprune. I think the reason you put it there is that it requires
> > access to a lot of internals that are static in partition.c.  In the
> > attached not yet cleaned version of the patch, I have moved a whole lot
> > of what you added to partition.c to partprune.c; and for the functions
> > and struct declarations that were required to make it work, I created
> > catalog/partition_internal.h.
> 
> Yes, I really wanted for most of the new code that this patch adds to land
> in the planner, especially after Robert's comments here:
> 
> https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com
> 
> It would've been nice if we'd gotten the "reorganizing partitioning code"
> thread resolved sooner.

Grumble.

I don't actually like very much the idea of putting all this code in
optimizer/util.  This morning it occurred to me that we should create a new
src/backend/partitioning/ (and a src/include/partitioning/ to go with
it) and drop a bunch of files there.  Even your proposed new partcache.c
will seem misplaced *anywhere*, since it contains support code to be
used by both planner and executor; in src/{backend,include}/partitioning
it will be able to serve both without it being a modularity wart.

BTW including partition_internal.h in partition.h would defeat the point
of having partition_internal.h in the first place -- at that point you'd
be better off just putting it all in partition.h and save the hassle of
a separate file.  But given the liberty with which catalog/partition.h
has been included everywhere else, IMO that would be pretty disastrous.

I propose to work on reorganizing this code after the commitfest is
over, as part of release stabilization.  I'd rather not have us
supporting a messy system for only five years, if we restructure during
pg12 (which would mean a lot of backpatching pain and pg11-specific
bugs); or worse, forever, if we keep the current proposed layout.

One thing I don't want to do is create a new file that we'll later have
to rename or move, so choosing the best locations is a necessity.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 20:38:23

David Rowley wrote:

> 2. I guess this will be removed before commit.
> 
> +#if 0
> <large section of comments>
> +#endif

Yeah, there is one sentence there I didn't quite understand and would
like to add it to the rewritten version of the comment before I remove
the whole ifdeffed-out comment.

 *    PARTCLAUSE_MATCH_STEPS: *clause_steps set to list of "partition pruning
 *    step(s)" generated for the clause due to it being a BoolExpr or a
 *    ScalarArrayOpExpr that's turned into one

Exactly what does "ScalarArrayOpExpr that's turned into one" means?
Does it mean we turn SAOP into BoolExpr?

(Yes, I know "#if 0" inside a comment doesn't do anything.  It's only
documentation for myself.)

If you look at the rest of the rewritten comment, you'll notice some
things probably need more explaining.  Wording suggestions welcome.

> 3. This comment seems like a strange thing to write just before
> testing if the clause matches the partition key.
> 
> + /* Clause does not match this partition key. */
> + if (equal(leftop, partkey))
> + *rightop = not_clause((Node *) clause)
> + ? (Expr *) makeBoolConst(false, false)
> + : (Expr *) makeBoolConst(true, false);

Yeah.  Looking at this function, I noticed it tests for BooleanTest, and
falls back to checking "not_clause" and a few equals.  Does it make
sense if the clause is a SAOP?  I added this assert:
    Assert(IsA(clause, BooleanTest) ||
           IsA(clause, BoolExpr) ||
           IsA(clause, RelabelType));

and it failed:
#3  0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, 
    clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068)
    at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159
2159        Assert(IsA(clause, BooleanTest) ||
(gdb) print *clause
$1 = {type = T_ScalarArrayOpExpr}

I'm not sure whether or not this function can trust that what's incoming
must absolutely be only those node types.

> 4. Comment needs removed.
> 
> + * has_default_part - Whether the table has a default partition

Done.

> The only other thing I noted on this pass is that we could get rid of:
> 
> + /* go check the next clause. */
> + if (unsupported_clause)
> + break;
> 
> and just "continue" instead of "break" in all cases apart from case
> PARTCLAUSE_UNSUPPORTED:
> 
> it would save a few lines and a single condition. What's there works,
> but thought this might be better...

Makes sense -- looking.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 20:40:57

BTW, having both key_is_null and key_is_not_null output args to convey a
single bit of info is a bit lame.  I'm removing it.  We could do the
same with a single boolean, since the return value already indicates
it's a matching IS [NOT] NULL clause; we only need to indicate whether
the NOT is present.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 20:54:39

Alvaro Herrera wrote:

> Yeah.  Looking at this function, I noticed it tests for BooleanTest, and
> falls back to checking "not_clause" and a few equals.  Does it make
> sense if the clause is a SAOP?  I added this assert:
>     Assert(IsA(clause, BooleanTest) ||
>            IsA(clause, BoolExpr) ||
>            IsA(clause, RelabelType));
> 
> and it failed:
> #3  0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, 
>     clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068)
>     at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159
> 2159        Assert(IsA(clause, BooleanTest) ||
> (gdb) print *clause
> $1 = {type = T_ScalarArrayOpExpr}
> 
> I'm not sure whether or not this function can trust that what's incoming
> must absolutely be only those node types.

So this is what I need for current regression tests not to crash
anymore:

    Assert(IsA(clause, BooleanTest) ||
           IsA(clause, BoolExpr) ||
           IsA(clause, RelabelType) ||
           IsA(clause, ScalarArrayOpExpr) ||
           IsA(clause, OpExpr) ||
           IsA(clause, Var));

I'm not confident in my ability to write code to handle all possible
cases right now (obviously there must be more cases that are not covered
by current regression tests), so I'll leave it without the assert since
it handles a couple of the useful cases, but I suspect it could stand
some more improvement.

I guess the question is, how interesting is boolean partitioning?  I bet
it has its uses.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

06 апреля 2018 г., 21:36:28

On Fri, Apr 6, 2018 at 11:54 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Alvaro Herrera wrote:
>
>> Yeah.  Looking at this function, I noticed it tests for BooleanTest, and
>> falls back to checking "not_clause" and a few equals.  Does it make
>> sense if the clause is a SAOP?  I added this assert:
>>       Assert(IsA(clause, BooleanTest) ||
>>                  IsA(clause, BoolExpr) ||
>>                  IsA(clause, RelabelType));
>>
>> and it failed:
>> #3  0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424,
>>     clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068)
>>     at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159
>> 2159          Assert(IsA(clause, BooleanTest) ||
>> (gdb) print *clause
>> $1 = {type = T_ScalarArrayOpExpr}
>>
>> I'm not sure whether or not this function can trust that what's incoming
>> must absolutely be only those node types.
>
> So this is what I need for current regression tests not to crash
> anymore:
>
>         Assert(IsA(clause, BooleanTest) ||
>                    IsA(clause, BoolExpr) ||
>                    IsA(clause, RelabelType) ||
>                    IsA(clause, ScalarArrayOpExpr) ||
>                    IsA(clause, OpExpr) ||
>                    IsA(clause, Var));
>
> I'm not confident in my ability to write code to handle all possible
> cases right now (obviously there must be more cases that are not covered
> by current regression tests), so I'll leave it without the assert since
> it handles a couple of the useful cases, but I suspect it could stand
> some more improvement.
>
> I guess the question is, how interesting is boolean partitioning?  I bet
> it has its uses.

match_boolean_partition_clauses() exists to capture some cases where
an OpExpr (any expression that returns a Boolean for that matter)
itself is the partition key:

create table boolpart (a int) partition by list ((a = 1));
create table boolpart_t partition of boolpart for values in ('true');
create table boolpart_f partition of boolpart for values in ('false');
explain select * from boolpart where a = 1;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..41.88 rows=13 width=4)
   ->  Seq Scan on boolpart_t  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 1)
(3 rows)

explain select * from boolpart where a = 2;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..41.88 rows=13 width=4)
   ->  Seq Scan on boolpart_f  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 2)
(3 rows)

So, it's not that we're only in position to accept certain node types
in match_boolean_partition_clauses().  Before it existed, the pruning
didn't work because it wasn't matched to the partition key in the
special way that match_boolean_partition_clauses() does and end up in
the block in match_clause_to_partition_key() where the OpExpr's are
analyzed for normal (non-Boolean) situations, where we extract either
the leftop or rightop and try to match it with the partition key.

It might as well be:

create table boolpart (a int) partition by list ((a in (1, 2)));

Requiring us to be position to match an ScalarArrayOpExpr with the
partition key.

This resembles match_boolean_index_clause(), by the way.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

06 апреля 2018 г., 21:41:06

On Fri, Apr 6, 2018 at 11:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Yeah, there is one sentence there I didn't quite understand and would
> like to add it to the rewritten version of the comment before I remove
> the whole ifdeffed-out comment.
>
>  *      PARTCLAUSE_MATCH_STEPS: *clause_steps set to list of "partition pruning
>  *      step(s)" generated for the clause due to it being a BoolExpr or a
>  *      ScalarArrayOpExpr that's turned into one
>
> Exactly what does "ScalarArrayOpExpr that's turned into one" means?
> Does it mean we turn SAOP into BoolExpr?

Yes, we turn a ScalarArrayOpExpr into a BoolExpr and generate prune
step for the latter.  Maybe we'll have a base pruning step that can
process a ScalarArrayOpExpr directly someday.  We create base steps
only for OpExpr's for now.

> If you look at the rest of the rewritten comment, you'll notice some
> things probably need more explaining.  Wording suggestions welcome.

When I looked at it earlier today, I thought your rewrite looked much better.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 21:46:26

Amit Langote wrote:

> Some comments on the code reorganizing part of the patch:
> 
> * Did you intentionally not put PartitionBoundInfoData and its accessor
> macros in partition_internal.h.  partprune.c would not need to include
> partition.h if we do that.

Not really.

After pondering this some more, I decided to call the new file
src/include/partition/partbounds.h; and the other new file will become
src/include/partition/partprune.h.  This leads naturally to the idea
that PartitionBoundInfoData will be in partbounds.h.  However, the
typedef struct PartitionBoundInfoData *PartitionBoundInfo will have to
remain in catalog/partition.h, at least for the time being.

> * Also, I wonder why you left PartitionPruneContext in partition.h.  Isn't
> it better taken out to partprune.h?

Yes.

> * Why isn't gen_partprune_steps() in partprune.h?  I see only
> prune_append_rel_partitions() exported out of partprune.c, but the runtime
> patch needs gen_partprune_steps() to be called from createplan.c.
> * I don't see get_matching_partitions() exported either.  Runtime pruning
> patch needs that too.

True -- both exported.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

06 апреля 2018 г., 22:41:39

Here's my proposed patch.

Idle thought: how about renaming the "constfalse" argument and variables
to "contradictory" or maybe just "contradict"?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

partprune-51.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

06 апреля 2018 г., 22:50:15

On Sat, Apr 7, 2018 at 1:41 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Here's my proposed patch.
>
> Idle thought: how about renaming the "constfalse" argument and variables
> to "contradictory" or maybe just "contradict"?

Sounds fine to me.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Jesper Pedersen

Дата:

06 апреля 2018 г., 23:45:03

Hi Alvaro,

On 04/06/2018 12:41 PM, Alvaro Herrera wrote:
> Here's my proposed patch.
> 
> Idle thought: how about renaming the "constfalse" argument and variables
> to "contradictory" or maybe just "contradict"?
> 

Passes check-world.

New directories, and variable rename seems like a good idea; either is ok.

Best regards,
  Jesper

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

07 апреля 2018 г., 02:28:00

So I pushed this 25 minutes ago, and already there's a couple of
buildfarm members complaining:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07

Both show exactly the same diff in test partition_prune:

*** /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out    Fri
Apr 6 15:55:08 2018
 
--- /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out    Fri Apr  6
16:01:402018
 
***************
*** 1348,1357 ****
  ----------+----+-----
   hp0      |    | 
   hp0      |  1 | 
!  hp0      |  1 | xxx
   hp3      | 10 | yyy
!  hp1      |    | xxx
!  hp2      | 10 | xxx
  (6 rows)
  
  -- partial keys won't prune, nor would non-equality conditions
--- 1348,1357 ----
  ----------+----+-----
   hp0      |    | 
   hp0      |  1 | 
!  hp0      | 10 | xxx
!  hp3      |    | xxx
   hp3      | 10 | yyy
!  hp2      |  1 | xxx
  (6 rows)
  
  -- partial keys won't prune, nor would non-equality conditions
***************
*** 1460,1466 ****
                     QUERY PLAN                    
  -------------------------------------------------
   Append
!    ->  Seq Scan on hp0
           Filter: ((a = 1) AND (b = 'xxx'::text))
  (3 rows)
  
--- 1460,1466 ----
                     QUERY PLAN                    
  -------------------------------------------------
   Append
!    ->  Seq Scan on hp2
           Filter: ((a = 1) AND (b = 'xxx'::text))
  (3 rows)
  
***************
*** 1468,1474 ****
                       QUERY PLAN                      
  -----------------------------------------------------
   Append
!    ->  Seq Scan on hp1
           Filter: ((a IS NULL) AND (b = 'xxx'::text))
  (3 rows)
  
--- 1468,1474 ----
                       QUERY PLAN                      
  -----------------------------------------------------
   Append
!    ->  Seq Scan on hp3
           Filter: ((a IS NULL) AND (b = 'xxx'::text))
  (3 rows)
  
***************
*** 1476,1482 ****
                      QUERY PLAN                    
  --------------------------------------------------
   Append
!    ->  Seq Scan on hp2
           Filter: ((a = 10) AND (b = 'xxx'::text))
  (3 rows)
  
--- 1476,1482 ----
                      QUERY PLAN                    
  --------------------------------------------------
   Append
!    ->  Seq Scan on hp0
           Filter: ((a = 10) AND (b = 'xxx'::text))
  (3 rows)
  
***************
*** 1494,1504 ****
   Append
     ->  Seq Scan on hp0
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
-    ->  Seq Scan on hp2
-          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
     ->  Seq Scan on hp3
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
! (7 rows)
  
  -- hash partitiong pruning doesn't occur with <> operator clauses
  explain (costs off) select * from hp where a <> 1 and b <> 'xxx';
--- 1494,1502 ----
   Append
     ->  Seq Scan on hp0
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
     ->  Seq Scan on hp3
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
! (5 rows)
  
  -- hash partitiong pruning doesn't occur with <> operator clauses
  explain (costs off) select * from hp where a <> 1 and b <> 'xxx';


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Andres Freund

Дата:

07 апреля 2018 г., 03:03:30

On 2018-04-06 17:28:00 -0300, Alvaro Herrera wrote:
> So I pushed this 25 minutes ago, and already there's a couple of
> buildfarm members complaining:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07
> 
> Both show exactly the same diff in test partition_prune:

There's also
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2018-04-06%2020%3A45%3A01
*** /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/expected/misc.out    2018-02-20
18:45:02.068665297-0800
 
--- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/results/misc.out    2018-04-06
13:55:50.718253850-0700
 
***************
*** 32,40 ****
  (6 rows)
  
  SELECT * FROM t1p WHERE o > 50 AND p like '%64%';
- LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
- LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
- LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
  LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_table name="public.t1p"
 
  LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column o"
 
  LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column p"
 
--- 32,37 ----

seems you just need to remove those rows from the expected file.

- Andres

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

07 апреля 2018 г., 05:01:55

On Fri, Apr 6, 2018 at 8:24 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I don't actually like very much the idea of putting all this code in
> optimizer/util.  This morning it occurred to me that we should create a new
> src/backend/partitioning/ (and a src/include/partitioning/ to go with
> it) and drop a bunch of files there.  Even your proposed new partcache.c
> will seem misplaced *anywhere*, since it contains support code to be
> used by both planner and executor; in src/{backend,include}/partitioning
> it will be able to serve both without it being a modularity wart.

Uh, what?

Surely partcache.c is correctly placed next to relcache.c and
syscache.c and everything else in src/backend/utils/cache.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

07 апреля 2018 г., 05:40:35

Robert Haas wrote:
> On Fri, Apr 6, 2018 at 8:24 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I don't actually like very much the idea of putting all this code in
> > optimizer/util.  This morning it occurred to me that we should create a new
> > src/backend/partitioning/ (and a src/include/partitioning/ to go with
> > it) and drop a bunch of files there.  Even your proposed new partcache.c
> > will seem misplaced *anywhere*, since it contains support code to be
> > used by both planner and executor; in src/{backend,include}/partitioning
> > it will be able to serve both without it being a modularity wart.
> 
> Uh, what?
> 
> Surely partcache.c is correctly placed next to relcache.c and
> syscache.c and everything else in src/backend/utils/cache.

Frankly, I'm not real sure about partcache.c yet.  Are you?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

07 апреля 2018 г., 06:35:41

Thank you Alvaro for rest of the cleanup and committing.

On Sat, Apr 7, 2018 at 5:28 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So I pushed this 25 minutes ago, and already there's a couple of
> buildfarm members complaining:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07
>
> Both show exactly the same diff in test partition_prune:
>
> *** /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out Fri
Apr 6 15:55:08 2018
 
> --- /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out   Fri Apr  6
16:01:402018
 
> ***************
> *** 1348,1357 ****
>   ----------+----+-----
>    hp0      |    |
>    hp0      |  1 |
> !  hp0      |  1 | xxx
>    hp3      | 10 | yyy
> !  hp1      |    | xxx
> !  hp2      | 10 | xxx
>   (6 rows)
>
>   -- partial keys won't prune, nor would non-equality conditions
> --- 1348,1357 ----
>   ----------+----+-----
>    hp0      |    |
>    hp0      |  1 |
> !  hp0      | 10 | xxx
> !  hp3      |    | xxx
>    hp3      | 10 | yyy
> !  hp2      |  1 | xxx
>   (6 rows)
>
>   -- partial keys won't prune, nor would non-equality conditions
> ***************
> *** 1460,1466 ****
>                      QUERY PLAN
>   -------------------------------------------------
>    Append
> !    ->  Seq Scan on hp0
>            Filter: ((a = 1) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1460,1466 ----
>                      QUERY PLAN
>   -------------------------------------------------
>    Append
> !    ->  Seq Scan on hp2
>            Filter: ((a = 1) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***************
> *** 1468,1474 ****
>                        QUERY PLAN
>   -----------------------------------------------------
>    Append
> !    ->  Seq Scan on hp1
>            Filter: ((a IS NULL) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1468,1474 ----
>                        QUERY PLAN
>   -----------------------------------------------------
>    Append
> !    ->  Seq Scan on hp3
>            Filter: ((a IS NULL) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***************
> *** 1476,1482 ****
>                       QUERY PLAN
>   --------------------------------------------------
>    Append
> !    ->  Seq Scan on hp2
>            Filter: ((a = 10) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1476,1482 ----
>                       QUERY PLAN
>   --------------------------------------------------
>    Append
> !    ->  Seq Scan on hp0
>            Filter: ((a = 10) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***************
> *** 1494,1504 ****
>    Append
>      ->  Seq Scan on hp0
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
> -    ->  Seq Scan on hp2
> -          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
>      ->  Seq Scan on hp3
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
> ! (7 rows)
>
>   -- hash partitiong pruning doesn't occur with <> operator clauses
>   explain (costs off) select * from hp where a <> 1 and b <> 'xxx';
> --- 1494,1502 ----
>    Append
>      ->  Seq Scan on hp0
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
>      ->  Seq Scan on hp3
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
> ! (5 rows)

So this same failure occurs on (noting the architecture):

ppc64:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52

ia64:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2018-04-06%2022%3A32%3A24

ppc64 (POWER7):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-04-06%2022%3A58%3A13

ppc64 (POWER7):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-06%2023%3A02%3A13

powerpc:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-04-06%2023%3A05%3A08

powerpc:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-04-06%2023%3A13%3A23

powerpc 32-bit userspace on ppc64 host:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07

Seems to be due to that the hashing function used in partitioning
gives different answer for a given set of partition key values than
others.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 06:43:29

On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote:
> Thank you Alvaro for rest of the cleanup and committing.

+10!

> So this same failure occurs on (noting the architecture):
>
> ppc64:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
>
> ia64:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2018-04-06%2022%3A32%3A24
>
> ppc64 (POWER7):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-04-06%2022%3A58%3A13
>
> ppc64 (POWER7):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-06%2023%3A02%3A13
>
> powerpc:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-04-06%2023%3A05%3A08
>
> powerpc:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-04-06%2023%3A13%3A23
>
> powerpc 32-bit userspace on ppc64 host:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07
>
> Seems to be due to that the hashing function used in partitioning
> gives different answer for a given set of partition key values than
> others.

They all look like bigendian CPUs.

https://en.wikipedia.org/wiki/Comparison_of_instruction_set_architectures#Endianness

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 07:31:51

On 7 April 2018 at 12:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote:
>> So this same failure occurs on (noting the architecture):
>>
>> Seems to be due to that the hashing function used in partitioning
>> gives different answer for a given set of partition key values than
>> others.
>
> They all look like bigendian CPUs.

I looked at all the regression test diffs for each of the servers you
mentioned and I verified that the diffs match on each of the 7
servers.

Maybe the best solution is to pull those tests out of
partition_prune.sql then create partition_prune_hash and just have an
alternative .out file with the partitions which match on bigendian
machines.

We could also keep them in the same file, but that's a much bigger
alternative file to maintain and more likely to get broken if someone
forgets to update it.

What do you think?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

07 апреля 2018 г., 07:50:10

On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 12:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote:
>>> So this same failure occurs on (noting the architecture):
>>>
>>> Seems to be due to that the hashing function used in partitioning
>>> gives different answer for a given set of partition key values than
>>> others.
>>
>> They all look like bigendian CPUs.
>
> I looked at all the regression test diffs for each of the servers you
> mentioned and I verified that the diffs match on each of the 7
> servers.
>
> Maybe the best solution is to pull those tests out of
> partition_prune.sql then create partition_prune_hash and just have an
> alternative .out file with the partitions which match on bigendian
> machines.
>
> We could also keep them in the same file, but that's a much bigger
> alternative file to maintain and more likely to get broken if someone
> forgets to update it.
>
> What do you think?

Yeah, that's an idea.

Is it alright though that same data may end up in different hash
partitions depending on the architecture?  IIRC, that's the way we
decided to go when using hash partitioning, but it would've been
clearer if there was already some evidence in regression tests that
that's what we've chosen, such as, some existing tests for tuple
routing.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 07:55:37

On 7 April 2018 at 13:50, Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
>> I looked at all the regression test diffs for each of the servers you
>> mentioned and I verified that the diffs match on each of the 7
>> servers.
>>
>> Maybe the best solution is to pull those tests out of
>> partition_prune.sql then create partition_prune_hash and just have an
>> alternative .out file with the partitions which match on bigendian
>> machines.
>>
>> We could also keep them in the same file, but that's a much bigger
>> alternative file to maintain and more likely to get broken if someone
>> forgets to update it.
>>
>> What do you think?
>
> Yeah, that's an idea.
>
> Is it alright though that same data may end up in different hash
> partitions depending on the architecture?  IIRC, that's the way we
> decided to go when using hash partitioning, but it would've been
> clearer if there was already some evidence in regression tests that
> that's what we've chosen, such as, some existing tests for tuple
> routing.

The only alternative would be to change all the hash functions so that
they normalise their endianness. It does not sound like something that
will perform very well. Plus it would break everyone's hash indexes on
a pg_upgrade.

pg_basebackups can't be transferred over to other architectures
anyway, so I'm not so worried about tuples being routed to other
partitions.

Maybe someone else can see a reason why this is bad?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 08:42:53

On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Maybe the best solution is to pull those tests out of
> partition_prune.sql then create partition_prune_hash and just have an
> alternative .out file with the partitions which match on bigendian
> machines.

Here's 1 of 2. I thought it was best to get the buildfarm green again
as soon as possible.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

0001-Remove-HASH-partition-pruning-tests.patch

Re: [HACKERS] path toward faster partition pruning

От

Andres Freund

Дата:

07 апреля 2018 г., 09:00:10

On 2018-04-07 14:42:53 +1200, David Rowley wrote:
> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > Maybe the best solution is to pull those tests out of
> > partition_prune.sql then create partition_prune_hash and just have an
> > alternative .out file with the partitions which match on bigendian
> > machines.
> 
> Here's 1 of 2. I thought it was best to get the buildfarm green again
> as soon as possible.

Do you have an estimate how long it'll take you to produce patch 2? It'd
be cool to get this covered again soon.  If you don't have access to a
big endian machine, we can construct the output from the buildfarm... So
pulling the tests out would be the only "urgent" thing, I can go on from
there.

- Andres

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

07 апреля 2018 г., 09:03:06

On Sat, Apr 7, 2018 at 7:25 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 13:50, Amit Langote <amitlangote09@gmail.com> wrote:
>> On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
>>> I looked at all the regression test diffs for each of the servers you
>>> mentioned and I verified that the diffs match on each of the 7
>>> servers.
>>>
>>> Maybe the best solution is to pull those tests out of
>>> partition_prune.sql then create partition_prune_hash and just have an
>>> alternative .out file with the partitions which match on bigendian
>>> machines.
>>>
>>> We could also keep them in the same file, but that's a much bigger
>>> alternative file to maintain and more likely to get broken if someone
>>> forgets to update it.
>>>
>>> What do you think?
>>
>> Yeah, that's an idea.
>>
>> Is it alright though that same data may end up in different hash
>> partitions depending on the architecture?  IIRC, that's the way we
>> decided to go when using hash partitioning, but it would've been
>> clearer if there was already some evidence in regression tests that
>> that's what we've chosen, such as, some existing tests for tuple
>> routing.
>
> The only alternative would be to change all the hash functions so that
> they normalise their endianness. It does not sound like something that
> will perform very well. Plus it would break everyone's hash indexes on
> a pg_upgrade.
>
> pg_basebackups can't be transferred over to other architectures
> anyway, so I'm not so worried about tuples being routed to other
> partitions.
>
> Maybe someone else can see a reason why this is bad?

I don't think the concept is bad by itself. That's expected, in fact,
we have added an option to pg_dump (dump through parent or some such)
to handle exactly this case. What Amit seems to be complaining though
is the regression test. We need to write regression tests so that they
produce the same plans, pruning same partitions by name, on all
architectures.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 09:04:37

On 7 April 2018 at 15:00, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-07 14:42:53 +1200, David Rowley wrote:
>> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> > Maybe the best solution is to pull those tests out of
>> > partition_prune.sql then create partition_prune_hash and just have an
>> > alternative .out file with the partitions which match on bigendian
>> > machines.
>>
>> Here's 1 of 2. I thought it was best to get the buildfarm green again
>> as soon as possible.
>
> Do you have an estimate how long it'll take you to produce patch 2? It'd
> be cool to get this covered again soon.  If you don't have access to a
> big endian machine, we can construct the output from the buildfarm... So
> pulling the tests out would be the only "urgent" thing, I can go on from
> there.

Attached.

I've not tested on a bigendian machine, but the diff -c between the
two output files match the diff on the failing buildfarm members.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

0002-Add-HASH-partition-pruning-tests.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 09:07:28

On 7 April 2018 at 15:03, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Sat, Apr 7, 2018 at 7:25 AM, David Rowley
>> The only alternative would be to change all the hash functions so that
>> they normalise their endianness. It does not sound like something that
>> will perform very well. Plus it would break everyone's hash indexes on
>> a pg_upgrade.
>>
>> pg_basebackups can't be transferred over to other architectures
>> anyway, so I'm not so worried about tuples being routed to other
>> partitions.
>>
>> Maybe someone else can see a reason why this is bad?
>
> I don't think the concept is bad by itself. That's expected, in fact,
> we have added an option to pg_dump (dump through parent or some such)
> to handle exactly this case. What Amit seems to be complaining though
> is the regression test. We need to write regression tests so that they
> produce the same plans, pruning same partitions by name, on all
> architectures.

Why is writing tests that produce the same output required?

We have many tests with alternative outputs. Look in
src/tests/regress/expected for files matching _1.out

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

07 апреля 2018 г., 09:14:45

On Sat, Apr 7, 2018 at 8:37 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 15:03, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Sat, Apr 7, 2018 at 7:25 AM, David Rowley
>>> The only alternative would be to change all the hash functions so that
>>> they normalise their endianness. It does not sound like something that
>>> will perform very well. Plus it would break everyone's hash indexes on
>>> a pg_upgrade.
>>>
>>> pg_basebackups can't be transferred over to other architectures
>>> anyway, so I'm not so worried about tuples being routed to other
>>> partitions.
>>>
>>> Maybe someone else can see a reason why this is bad?
>>
>> I don't think the concept is bad by itself. That's expected, in fact,
>> we have added an option to pg_dump (dump through parent or some such)
>> to handle exactly this case. What Amit seems to be complaining though
>> is the regression test. We need to write regression tests so that they
>> produce the same plans, pruning same partitions by name, on all
>> architectures.
>
> Why is writing tests that produce the same output required?
>
> We have many tests with alternative outputs. Look in
> src/tests/regress/expected for files matching _1.out
>

That's true, but we usually add such alternative output when we know
all the variants possible as long as "all the variants" do not cover
everything possible. AFAIU, that's not true here. Also, on a given
machine a particular row is guaranteed to fall in a given partition.
On a different machine it will fall in some other partition, but
always that partition on that machine. We don't have a way to select
alternate output based on the architecture. May be a better idea is to
use .source file, creating .out on the fly based on the architecture
of machine like testing the hash output for a given value to decide
which partition it will fall into and then crafting .out with that
partition's name.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Andres Freund

Дата:

07 апреля 2018 г., 09:18:41

On 2018-04-07 15:04:37 +1200, David Rowley wrote:
> On 7 April 2018 at 15:00, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-04-07 14:42:53 +1200, David Rowley wrote:
> >> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
> >> > Maybe the best solution is to pull those tests out of
> >> > partition_prune.sql then create partition_prune_hash and just have an
> >> > alternative .out file with the partitions which match on bigendian
> >> > machines.
> >>
> >> Here's 1 of 2. I thought it was best to get the buildfarm green again
> >> as soon as possible.
> >
> > Do you have an estimate how long it'll take you to produce patch 2? It'd
> > be cool to get this covered again soon.  If you don't have access to a
> > big endian machine, we can construct the output from the buildfarm... So
> > pulling the tests out would be the only "urgent" thing, I can go on from
> > there.
> 
> Attached.
> 
> I've not tested on a bigendian machine, but the diff -c between the
> two output files match the diff on the failing buildfarm members.

I've pushed the two patches (collapsed). Trying to get the BF green-ish
again...

- Andres

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 09:20:41

On 7 April 2018 at 15:18, Andres Freund <andres@anarazel.de> wrote:
> I've pushed the two patches (collapsed). Trying to get the BF green-ish
> again...

Thanks!


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 09:25:27

On 7 April 2018 at 15:14, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Sat, Apr 7, 2018 at 8:37 AM, David Rowley
>> Why is writing tests that produce the same output required?
>>
>> We have many tests with alternative outputs. Look in
>> src/tests/regress/expected for files matching _1.out
>>
>
> That's true, but we usually add such alternative output when we know
> all the variants possible as long as "all the variants" do not cover
> everything possible. AFAIU, that's not true here. Also, on a given
> machine a particular row is guaranteed to fall in a given partition.
> On a different machine it will fall in some other partition, but
> always that partition on that machine. We don't have a way to select
> alternate output based on the architecture. May be a better idea is to
> use .source file, creating .out on the fly based on the architecture
> of machine like testing the hash output for a given value to decide
> which partition it will fall into and then crafting .out with that
> partition's name.

Sounds like you're saying that if we have too many alternative files
then there's a chance that one could pass by luck.

Maybe we can just back up what's just been committed by actually
executing the queries and ensuring that all rows that made it into the
table make it back out again.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

07 апреля 2018 г., 09:39:23

On Sat, Apr 7, 2018 at 8:55 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 15:14, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Sat, Apr 7, 2018 at 8:37 AM, David Rowley
>>> Why is writing tests that produce the same output required?
>>>
>>> We have many tests with alternative outputs. Look in
>>> src/tests/regress/expected for files matching _1.out
>>>
>>
>> That's true, but we usually add such alternative output when we know
>> all the variants possible as long as "all the variants" do not cover
>> everything possible. AFAIU, that's not true here. Also, on a given
>> machine a particular row is guaranteed to fall in a given partition.
>> On a different machine it will fall in some other partition, but
>> always that partition on that machine. We don't have a way to select
>> alternate output based on the architecture. May be a better idea is to
>> use .source file, creating .out on the fly based on the architecture
>> of machine like testing the hash output for a given value to decide
>> which partition it will fall into and then crafting .out with that
>> partition's name.
>
> Sounds like you're saying that if we have too many alternative files
> then there's a chance that one could pass by luck.

Yes.

>
> Maybe we can just back up what's just been committed by actually
> executing the queries and ensuring that all rows that made it into the
> table make it back out again.

That's one way. But how would we make sure that they landed in proper
partition. Actually we do not know what's proper partition for a given
architecture. And how would we make sure that all rows with the same
partition key land in the same partition. That's why I am suggesting
to calculate the hash value, take modulo and craft the name of
partition where corresponding row will land on a given architecture.
That way, we are sure that the tuple routing logic is correct and also
the partition pruning logic.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Tom Lane

Дата:

07 апреля 2018 г., 09:41:22

David Rowley <david.rowley@2ndquadrant.com> writes:
> Sounds like you're saying that if we have too many alternative files
> then there's a chance that one could pass by luck.

Yeah, exactly: it passed, but did it pass for the right reason?

If there's just two expected-files, it's likely not a big problem,
but if you have a bunch it's something to worry about.

I'm also wondering how come we had hash partitioning before and
did not have this sort of problem.  Is it just that we added a
new test that's more sensitive to the details of the hashing
(if so, could it be made less so)?  Or is there actually more
platform dependence now than before (and if so, why is that)?

            regards, tom lane

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 09:42:21

On 7 April 2018 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm also wondering how come we had hash partitioning before and
> did not have this sort of problem.  Is it just that we added a
> new test that's more sensitive to the details of the hashing
> (if so, could it be made less so)?  Or is there actually more
> platform dependence now than before (and if so, why is that)?

We didn't prune HASH partitions before today. They were just all returned.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Andres Freund

Дата:

07 апреля 2018 г., 09:46:22

Hi,

On 2018-04-06 23:41:22 -0400, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > Sounds like you're saying that if we have too many alternative files
> > then there's a chance that one could pass by luck.
> 
> Yeah, exactly: it passed, but did it pass for the right reason?
> 
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.

There should be only two alternatives, given our current hashing
implementation, right?

Greetings,

Andres Freund

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 09:49:54

On 7 April 2018 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> Sounds like you're saying that if we have too many alternative files
>> then there's a chance that one could pass by luck.
>
> Yeah, exactly: it passed, but did it pass for the right reason?
>
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.

Right, I suggest we wait and see if all members go green again as a
result of 40e42e1024c, and if they're happy then we could maybe leave
it as is with the 2 alternatives output files.

If there are some other variations that crop up, then we can think
harder about what we can do to improve the coverage.

I don't particularly think it matters which hash partition a tuple
goes into, as long as the hash function spreads the values out enough
and most importantly, the pruning code looks for the tuple in the
partition that it was actually inserted into in the first place.
Obviously, we also want to ensure we never do anything which would
change the matching partition in either minor or major version
upgrades too.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Andres Freund

Дата:

07 апреля 2018 г., 10:09:23

Hi,

On 2018-04-07 15:49:54 +1200, David Rowley wrote:
> Right, I suggest we wait and see if all members go green again as a
> result of 40e42e1024c, and if they're happy then we could maybe leave
> it as is with the 2 alternatives output files.

At least the first previously borked animal came back green (termite).


> I don't particularly think it matters which hash partition a tuple
> goes into, as long as the hash function spreads the values out enough
> and most importantly, the pruning code looks for the tuple in the
> partition that it was actually inserted into in the first place.
> Obviously, we also want to ensure we never do anything which would
> change the matching partition in either minor or major version
> upgrades too.

+1


I've also attempted to fix rhinoceros's failure I remarked upon a couple
hours ago in
https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

Greetings,

Andres Freund

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 10:11:03

On 7 April 2018 at 09:03, Andres Freund <andres@anarazel.de> wrote:
> There's also
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2018-04-06%2020%3A45%3A01
> *** /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/expected/misc.out       2018-02-20
18:45:02.068665297-0800
 
> --- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/results/misc.out        2018-04-06
13:55:50.718253850-0700
 
> ***************
> *** 32,40 ****
>   (6 rows)
>
>   SELECT * FROM t1p WHERE o > 50 AND p like '%64%';
> - LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
> - LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
> - LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
>   LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_table name="public.t1p"
 
>   LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column o"
 
>   LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column p"
 
> --- 32,37 ----
>
> seems you just need to remove those rows from the expected file.

Agreed.

Patch attached.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

part_prune_sepgsql_fix.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 10:13:13

On 7 April 2018 at 16:09, Andres Freund <andres@anarazel.de> wrote:
> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

Oh, thanks!

I had just been looking at that too...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

07 апреля 2018 г., 10:14:52

On 7 April 2018 at 15:18, Andres Freund <andres@anarazel.de> wrote:
> I've pushed the two patches (collapsed). Trying to get the BF green-ish
> again...

termite has now gone green.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

07 апреля 2018 г., 10:18:08

On Sat, Apr 7, 2018 at 1:09 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-04-07 15:49:54 +1200, David Rowley wrote:
>> Right, I suggest we wait and see if all members go green again as a
>> result of 40e42e1024c, and if they're happy then we could maybe leave
>> it as is with the 2 alternatives output files.
>
> At least the first previously borked animal came back green (termite).
>
>
>> I don't particularly think it matters which hash partition a tuple
>> goes into, as long as the hash function spreads the values out enough
>> and most importantly, the pruning code looks for the tuple in the
>> partition that it was actually inserted into in the first place.
>> Obviously, we also want to ensure we never do anything which would
>> change the matching partition in either minor or major version
>> upgrades too.
>
> +1

+1

Given that the difference only appeared on animals that David pointed
out have big-endian architecture, it seems we'd only need two output
files.  It does seem true that the extended hashing functions  that
were adding to support partitioning would somehow be affected by
endianness.

Thank you David for creating the patches and Andres for committing it.
Buildfarm seems to be turning green where it had gone red due to the
hashing differences.

> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

Thanks Andres.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Tom Lane

Дата:

07 апреля 2018 г., 10:39:24

Amit Langote <amitlangote09@gmail.com> writes:
> Given that the difference only appeared on animals that David pointed
> out have big-endian architecture, it seems we'd only need two output
> files.

Dunno, I'm wondering whether 32 vs 64 bit will make a difference.

            regards, tom lane

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

07 апреля 2018 г., 12:41:35

On Sat, Apr 7, 2018 at 1:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
>> Given that the difference only appeared on animals that David pointed
>> out have big-endian architecture, it seems we'd only need two output
>> files.
>
> Dunno, I'm wondering whether 32 vs 64 bit will make a difference.

There was one 32-bit animal in the failing set, which apparently
produces the same hashes as others (allegedly due to endianness
difference).

powerpc 32-bit userspace on ppc64 host:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07

...and it has turned green since the alternative outputs fix went in.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-07%2004%3A06%3A09

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

07 апреля 2018 г., 17:13:23

Andres Freund wrote:
> Hi,
> 
> On 2018-04-07 15:49:54 +1200, David Rowley wrote:
> > Right, I suggest we wait and see if all members go green again as a
> > result of 40e42e1024c, and if they're happy then we could maybe leave
> > it as is with the 2 alternatives output files.
> 
> At least the first previously borked animal came back green (termite).

Thanks everyone for addressing this.

> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

And this, too.  I was unsure if this was because we were missing calling
some object access hook from the new code, or the additional pruning.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Andres Freund

Дата:

07 апреля 2018 г., 21:48:45

On 2018-04-07 08:13:23 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > I've also attempted to fix rhinoceros's failure I remarked upon a couple
> > hours ago in
> > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de
> 
> And this, too.  I was unsure if this was because we were missing calling
> some object access hook from the new code, or the additional pruning.

That's possible.  I did attempt to skim the code, that's where my
complain about the docs originated. There certainly isn't an
InvokeFunctionExecuteHook() present.  It's not clear to me whether
that's an issue - we don't invoke the hooks in a significant number of
places either, and as far as I can discern there's not much rule or
reason about where we invoke it.

Greetings,

Andres Freund

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

08 апреля 2018 г., 02:00:34

Andres Freund wrote:
> On 2018-04-07 08:13:23 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:
> > > I've also attempted to fix rhinoceros's failure I remarked upon a couple
> > > hours ago in
> > > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de
> > 
> > And this, too.  I was unsure if this was because we were missing calling
> > some object access hook from the new code, or the additional pruning.
> 
> That's possible.  I did attempt to skim the code, that's where my
> complain about the docs originated. There certainly isn't an
> InvokeFunctionExecuteHook() present.  It's not clear to me whether
> that's an issue - we don't invoke the hooks in a significant number of
> places either, and as far as I can discern there's not much rule or
> reason about where we invoke it.

I managed to convince myself that it's not higher-level code's
responsibility to invoke the execute hooks; the likelihood of bugs of
omission seems just too large.  But maybe I'm wrong.

There's a small number of InvokeFunctionExecuteHook calls in the
executor, but I really doubt that it exhaustively covers everyplace
where catalogued functions are called in the executor.

CC'ing KaiGai and Stephen Frost; they may want to chip in here.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

09 апреля 2018 г., 09:48:57

While looking at the docs in [1], I saw that we still mention:

4. Ensure that the constraint_exclusion configuration parameter is not
disabled in postgresql.conf. If it is, queries will not be optimized
as desired.

This is no longer true. The attached patch removed it.

[1] https://www.postgresql.org/docs/10/static/ddl-partitioning.html

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

0001-Remove-mention-of-constraint_exclusion-in-partitioni.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

09 апреля 2018 г., 15:27:32

Hi David.

On 2018/04/09 12:48, David Rowley wrote:
> While looking at the docs in [1], I saw that we still mention:
> 
> 4. Ensure that the constraint_exclusion configuration parameter is not
> disabled in postgresql.conf. If it is, queries will not be optimized
> as desired.
> 
> This is no longer true. The attached patch removed it.
> 
> [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.htm
Thanks.  I was aware of the changes that would need to be made, but you
beat me to writing the patch itself.

About the patch:

While the users no longer need to enable constraint_exclusion true for
select queries, one would still need it for update/delete queries, because
the new pruning logic only gets invoked for the former.  Alas...

Also, further down on that page, there is a 5.10.4 Partitioning and
Constraint Exclusion sub-section.  I think it would also need some tweaks
due to new developments.

I updated your patch to fix that.  Please take a look.

Thanks,
Amit

Вложения

v2-0001-Remove-mention-of-constraint_exclusion-in-partiti.patch

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

09 апреля 2018 г., 21:26:04

On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> Sounds like you're saying that if we have too many alternative files
>> then there's a chance that one could pass by luck.
>
> Yeah, exactly: it passed, but did it pass for the right reason?
>
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.
>
> I'm also wondering how come we had hash partitioning before and
> did not have this sort of problem.  Is it just that we added a
> new test that's more sensitive to the details of the hashing
> (if so, could it be made less so)?  Or is there actually more
> platform dependence now than before (and if so, why is that)?

The existing hash partitioning tests did have some dependencies on the
hash function, but they took care not to use the built-in hash
functions.  Instead they did stuff like this:

CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
$$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
CREATE TABLE mchash (a int, b text, c jsonb)
  PARTITION BY HASH (a test_int4_ops, b test_text_ops);

I think that this approach should also be used for the new tests.
Variant expected output files are a pain to maintain, and you
basically just have to take whatever output you get as the right
answer, because nobody knows what output a certain built-in hash
function should produce for a given input except by running the code.
If you do the kind of thing shown above, though, then you can easily
see by inspection that you're getting the right answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

10 апреля 2018 г., 10:27:36

On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>> Sounds like you're saying that if we have too many alternative files
>>> then there's a chance that one could pass by luck.
>>
>> Yeah, exactly: it passed, but did it pass for the right reason?
>>
>> If there's just two expected-files, it's likely not a big problem,
>> but if you have a bunch it's something to worry about.
>>
>> I'm also wondering how come we had hash partitioning before and
>> did not have this sort of problem.  Is it just that we added a
>> new test that's more sensitive to the details of the hashing
>> (if so, could it be made less so)?  Or is there actually more
>> platform dependence now than before (and if so, why is that)?
>
> The existing hash partitioning tests did have some dependencies on the
> hash function, but they took care not to use the built-in hash
> functions.  Instead they did stuff like this:
>
> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
> CREATE TABLE mchash (a int, b text, c jsonb)
>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>
> I think that this approach should also be used for the new tests.
> Variant expected output files are a pain to maintain, and you
> basically just have to take whatever output you get as the right
> answer, because nobody knows what output a certain built-in hash
> function should produce for a given input except by running the code.
> If you do the kind of thing shown above, though, then you can easily
> see by inspection that you're getting the right answer.

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

10 апреля 2018 г., 14:56:19

On 2018/04/10 13:27, Ashutosh Bapat wrote:
> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>>> Sounds like you're saying that if we have too many alternative files
>>>> then there's a chance that one could pass by luck.
>>>
>>> Yeah, exactly: it passed, but did it pass for the right reason?
>>>
>>> If there's just two expected-files, it's likely not a big problem,
>>> but if you have a bunch it's something to worry about.
>>>
>>> I'm also wondering how come we had hash partitioning before and
>>> did not have this sort of problem.  Is it just that we added a
>>> new test that's more sensitive to the details of the hashing
>>> (if so, could it be made less so)?  Or is there actually more
>>> platform dependence now than before (and if so, why is that)?
>>
>> The existing hash partitioning tests did have some dependencies on the
>> hash function, but they took care not to use the built-in hash
>> functions.  Instead they did stuff like this:
>>
>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>> CREATE TABLE mchash (a int, b text, c jsonb)
>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> I think that this approach should also be used for the new tests.
>> Variant expected output files are a pain to maintain, and you
>> basically just have to take whatever output you get as the right
>> answer, because nobody knows what output a certain built-in hash
>> function should produce for a given input except by running the code.
>> If you do the kind of thing shown above, though, then you can easily
>> see by inspection that you're getting the right answer.

Thanks for the idea.  I think it makes sense and also agree that alternate
outputs approach is not perfectly reliable and maintainable.

> +1.

Attached find a patch that rewrites hash partition pruning tests that
away.  It creates two hash operator classes, one for int4 and another for
text type and uses them to create hash partitioned table to be used in the
tests, like done in the existing tests in hash_part.sql.  Since that makes
tests (hopefully) reliably return the same result always, I no longer see
the need to keep them in a separate partition_prune_hash.sql.  The
reasoning behind having the separate file was to keep the alternative
output file small as David explained in [1].

However, I noticed that there is a bug in RelationBuildPartitionKey that
causes a crash when using a SQL function as partition support function as
the revised tests do, so please refer to and apply the patches I posted
here before running the revised tests:

https://www.postgresql.org/message-id/3041e853-b1dd-a0c6-ff21-7cc5633bffd0%40lab.ntt.co.jp

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com

Вложения

v1-0001-Rewrite-hash-partition-pruning-tests-to-use-custo.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

10 апреля 2018 г., 18:02:55

On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>
> Thanks for the idea.  I think it makes sense and also agree that alternate
> outputs approach is not perfectly reliable and maintainable.
>
>> +1.
>
> Attached find a patch that rewrites hash partition pruning tests that
> away.  It creates two hash operator classes, one for int4 and another for
> text type and uses them to create hash partitioned table to be used in the
> tests, like done in the existing tests in hash_part.sql.  Since that makes
> tests (hopefully) reliably return the same result always, I no longer see
> the need to keep them in a separate partition_prune_hash.sql.  The
> reasoning behind having the separate file was to keep the alternative
> output file small as David explained in [1].
> [1]
> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com

I had a quick look, but I'm still confused about why a function like
hash_uint32_extended() is susceptible to varying results depending on
CPU endianness but hash_combine64 is not.

Apart from that confusion, looking at the patch:

+CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
+$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
+CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
+OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
+CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
+$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;


Why coalesce here? Maybe I've not thought of something, but coalesce
only seems useful to me if there's > 1 argument. Plus the function is
strict, so not sure it's really doing even if you added a default.

I know this one was there before, but I only just noticed it:

+-- pruning should work if non-null values are provided for all the keys
+explain (costs off) select * from hp where a is null and b is null;

The comment is a bit misleading given the first test below it is
testing for nulls. Maybe it can be changed to

+-- pruning should work if values or is null clauses are provided for
all partition keys.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

10 апреля 2018 г., 18:11:55

On Tue, Apr 10, 2018 at 5:32 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> Thanks for the idea.  I think it makes sense and also agree that alternate
>> outputs approach is not perfectly reliable and maintainable.
>>
>>> +1.
>>
>> Attached find a patch that rewrites hash partition pruning tests that
>> away.  It creates two hash operator classes, one for int4 and another for
>> text type and uses them to create hash partitioned table to be used in the
>> tests, like done in the existing tests in hash_part.sql.  Since that makes
>> tests (hopefully) reliably return the same result always, I no longer see
>> the need to keep them in a separate partition_prune_hash.sql.  The
>> reasoning behind having the separate file was to keep the alternative
>> output file small as David explained in [1].
>> [1]
>> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
>
> I had a quick look, but I'm still confused about why a function like
> hash_uint32_extended() is susceptible to varying results depending on
> CPU endianness but hash_combine64 is not.
>
> Apart from that confusion, looking at the patch:
>
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>
>
> Why coalesce here? Maybe I've not thought of something, but coalesce
> only seems useful to me if there's > 1 argument. Plus the function is
> strict, so not sure it's really doing even if you added a default.

I think Amit Langote wanted to write coalesce($1, $2), $2 being the
seed for hash function. See how hash operator class functions are
defined in sql/insert.sql. May be we should just use the same
functions or even the same tables.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

11 апреля 2018 г., 11:47:53

Thanks for the comment.

On 2018/04/10 21:11, Ashutosh Bapat wrote:
> On Tue, Apr 10, 2018 at 5:32 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Apart from that confusion, looking at the patch:
>>
>> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
>> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
>> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
>> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
>> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>>
>>
>> Why coalesce here? Maybe I've not thought of something, but coalesce
>> only seems useful to me if there's > 1 argument. Plus the function is
>> strict, so not sure it's really doing even if you added a default.
> 
> I think Amit Langote wanted to write coalesce($1, $2), $2 being the
> seed for hash function. See how hash operator class functions are
> defined in sql/insert.sql.

Actually, I referenced functions and operator classes defined in
hash_part.sql, not insert.sql.  Although as you point out, I didn't think
very hard about the significance of $2 passed to coalesce in those
functions.  I will fix that and add it back, along with some other changes
that makes them almost identical with definitions in hash_part.sql.

> May be we should just use the same
> functions or even the same tables.

Because hash_part.sql and partition_prune.sql tests run in parallel, I've
decided to rename the functions, operator classes, and the tables in
partition_prune.sql.  It seems like a good idea in any case.  Also, since
the existing pruning tests were written with that table, I decided not to
change that.

Will post an updated patch after addressing David's comment.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

11 апреля 2018 г., 12:04:24

Thanks for the review.

On 2018/04/10 21:02, David Rowley wrote:
> On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> Thanks for the idea.  I think it makes sense and also agree that alternate
>> outputs approach is not perfectly reliable and maintainable.
>>
>>> +1.
>>
>> Attached find a patch that rewrites hash partition pruning tests that
>> away.  It creates two hash operator classes, one for int4 and another for
>> text type and uses them to create hash partitioned table to be used in the
>> tests, like done in the existing tests in hash_part.sql.  Since that makes
>> tests (hopefully) reliably return the same result always, I no longer see
>> the need to keep them in a separate partition_prune_hash.sql.  The
>> reasoning behind having the separate file was to keep the alternative
>> output file small as David explained in [1].
>> [1]
>> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
> 
> I had a quick look, but I'm still confused about why a function like
> hash_uint32_extended() is susceptible to varying results depending on
> CPU endianness but hash_combine64 is not.

It might as well be the combination of both that's sensitive to
endianness.  I too am not sure exactly which part.  They're are both used
in succession in compute_hash_value:

            /*
             * Compute hash for each datum value by calling respective
             * datatype-specific hash functions of each partition key
             * attribute.
             */
            hash = FunctionCall2(&partsupfunc[i], values[i], seed);

            /* Form a single 64-bit hash value */
            rowHash = hash_combine64(rowHash, DatumGetUInt64(hash));

> Apart from that confusion, looking at the patch:
> 
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> 
> 
> Why coalesce here? Maybe I've not thought of something, but coalesce
> only seems useful to me if there's > 1 argument. Plus the function is
> strict, so not sure it's really doing even if you added a default.

After reading Ashutosh's comment, I realized I didn't really mean to add
the STRICT to those function definitions.  As these are not operators, but
support (hash) procedures, it's insignificant to the pruning code whether
they are STRICT or not, unlike clause operators where it is.

Also, I've adopted the coalesce-based hashing function from hash_part.sql,
albeit with unnecessary tweaks.  I've not read anywhere about why the
coalesce was used in the first place, but it's insignificant for our
purpose here anyway.

> I know this one was there before, but I only just noticed it:
> 
> +-- pruning should work if non-null values are provided for all the keys
> +explain (costs off) select * from hp where a is null and b is null;
> 
> The comment is a bit misleading given the first test below it is
> testing for nulls. Maybe it can be changed to
> 
> +-- pruning should work if values or is null clauses are provided for
> all partition keys.
I have adjusted the comments.

Updated patch attached.

Thanks,
Amit

Вложения

v2-0001-Rewrite-hash-partition-pruning-tests-to-use-custo.patch

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

11 апреля 2018 г., 14:59:48

On 11 April 2018 at 18:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Updated patch attached.

Thanks for the updated patch.

The only thing I'm not sure about is the chances you've made to the
COALESCE function.

+CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
+$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE;
+CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
+OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
+CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
+$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE;

Why does one default to the seed and the other to an empty string?
Shouldn't they both do the same thing? If you were to copy the
hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any
special reason not to do that?

Also just wondering if it's worth adding some verification that we've
actually eliminated the correct partitions by backing the tests up
with a call to satisfies_hash_partition.

I've attached a delta patch that applies to your v2 which does this.
Do you think it's worth doing?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

add_satisfies_hash_partition_verification.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

11 апреля 2018 г., 15:22:04

Hi David.

Thanks for the review.

On 2018/04/11 17:59, David Rowley wrote:
> On 11 April 2018 at 18:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Updated patch attached.
> 
> Thanks for the updated patch.
> 
> The only thing I'm not sure about is the chances you've made to the
> COALESCE function.
> 
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE;
> 
> Why does one default to the seed and the other to an empty string?
> Shouldn't they both do the same thing? If you were to copy the
> hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any
> special reason not to do that?

Oops, so I hadn't actually restored it to the way it is in hash_part.sql.

Also, Ashutosh was talking about the custom hashing function used in
insert.sql, not hash_part.sql, which I based my revision upon.

Fixed it now.

> Also just wondering if it's worth adding some verification that we've
> actually eliminated the correct partitions by backing the tests up
> with a call to satisfies_hash_partition.
> 
> I've attached a delta patch that applies to your v2 which does this.
> Do you think it's worth doing?

We can see check by inspection that individual values are in appropriate
partitions, which is the point of having the inserts and the select just
above the actual pruning related tests.  So, I'm not sure if adding the
satisfies_hash_partition against each pruning tests adds much.

Attached revised patch.

Thanks,
Amit

Вложения

v3-0001-Rewrite-hash-partition-pruning-tests-to-use-custo.patch

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

11 апреля 2018 г., 15:44:02

On Wed, Apr 11, 2018 at 2:52 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

>>
>> I've attached a delta patch that applies to your v2 which does this.
>> Do you think it's worth doing?
>
> We can see check by inspection that individual values are in appropriate
> partitions, which is the point of having the inserts and the select just
> above the actual pruning related tests.  So, I'm not sure if adding the
> satisfies_hash_partition against each pruning tests adds much.

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

11 апреля 2018 г., 18:04:56

On 11 April 2018 at 21:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Also just wondering if it's worth adding some verification that we've
>> actually eliminated the correct partitions by backing the tests up
>> with a call to satisfies_hash_partition.
>>
>> I've attached a delta patch that applies to your v2 which does this.
>> Do you think it's worth doing?
>
> We can see check by inspection that individual values are in appropriate
> partitions, which is the point of having the inserts and the select just
> above the actual pruning related tests.  So, I'm not sure if adding the
> satisfies_hash_partition against each pruning tests adds much.

Right, that's true.

> Attached revised patch.

Thanks. It looks fine to me, with or without the
satisfies_hash_partition tests. I agree that they're probably
overkill, but I see you've added them now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

11 апреля 2018 г., 18:35:54

Here's an idea.  Why don't we move the function/opclass creation lines
to insert.sql, without the DROPs, and use the same functions/opclasses
in the three tests insert.sql, alter_table.sql, hash_part.sql and
partition_prune.sql, i.e. not recreate what are essentially the same
objects three times?  This also leaves them around for the pg_upgrade
test, which is not a bad thing.

(This would require a few updates to insert.sql because the definitions
there are different, but it shouldn't be a problem coverage-wise.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

12 апреля 2018 г., 08:12:16

On 2018/04/11 21:35, Alvaro Herrera wrote:
> Here's an idea.  Why don't we move the function/opclass creation lines
> to insert.sql, without the DROPs, and use the same functions/opclasses
> in the three tests insert.sql, alter_table.sql, hash_part.sql and
> partition_prune.sql, i.e. not recreate what are essentially the same
> objects three times?  This also leaves them around for the pg_upgrade
> test, which is not a bad thing.
> 
> (This would require a few updates to insert.sql because the definitions
> there are different, but it shouldn't be a problem coverage-wise.)

OK, I've tried doing that.  Needed adjustments to hash_part.sql as well.
The hash function for int4 was defined differently in insert.sql,
alter_table.sql, and hash_part.sql.  I went with the definition in
insert.sql, which although slightly different from the one
alter_table.sql, didn't affect the latter's output in any way.  Since the
definition in hash_part.sql was totally different, a couple of tests
needed adjusting after starting to use hash opclasses defined in insert.sql.

Attached updated patch.

PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
that there are hash partitioning related tests in create_table,
foreign_key, and partition_join files as well.  Do we want to use the
custom opclass in those files as well?

Thanks,
Amit

Вложения

v4-0001-Rewrite-hash-partition-pruning-tests-to-use-custo.patch

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

12 апреля 2018 г., 22:47:13

On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Here's an idea.  Why don't we move the function/opclass creation lines
> to insert.sql, without the DROPs, and use the same functions/opclasses
> in the three tests insert.sql, alter_table.sql, hash_part.sql and
> partition_prune.sql, i.e. not recreate what are essentially the same
> objects three times?  This also leaves them around for the pg_upgrade
> test, which is not a bad thing.

That sounds good, but maybe we should go further and move the
partitioning tests out of generically-named things like insert.sql
altogether and have test names that actually mention partitioning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

13 апреля 2018 г., 08:15:14

On 2018/04/13 1:47, Robert Haas wrote:
> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Here's an idea.  Why don't we move the function/opclass creation lines
>> to insert.sql, without the DROPs, and use the same functions/opclasses
>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>> partition_prune.sql, i.e. not recreate what are essentially the same
>> objects three times?  This also leaves them around for the pg_upgrade
>> test, which is not a bad thing.
> 
> That sounds good, but maybe we should go further and move the
> partitioning tests out of generically-named things like insert.sql
> altogether and have test names that actually mention partitioning.

Do you mean to do that for *all* files that have tests exercising some
partitioning code, even if it's just one test?  I can see that tests in at
least some of them could be put into their own partition_ file as follows:

partition_insert (including tests in insert_conflict)
partition_update
partition_triggers
partition_indexing (indexing.sql added when partitioned indexes went in)
partition_ddl (for the tests in create_table and alter_table)

That leaves:

cluster
create_index (one test here could be moved to partition_indexing?)
foreign_data (could be moved to partition_ddl?)
foreign_key  (could be moved to partition_ddl?)
hash_part    (leave alone, because already contains 'part' in the name?)
identity
join
plancache
plpgsql
publication
rowsecurity
rules
stats_ext
tablesample
truncate
updatable_views
vacuum


What about the tests in inherit.sql that start with:

--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
--

Maybe, just move all of them to partition_prune.sql, because we no longer
use constraint exclusion for pruning?

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

13 апреля 2018 г., 09:03:18

On 13 April 2018 at 14:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/13 1:47, Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> Here's an idea.  Why don't we move the function/opclass creation lines
>>> to insert.sql, without the DROPs, and use the same functions/opclasses
>>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>>> partition_prune.sql, i.e. not recreate what are essentially the same
>>> objects three times?  This also leaves them around for the pg_upgrade
>>> test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> Do you mean to do that for *all* files that have tests exercising some
> partitioning code, even if it's just one test?  I can see that tests in at
> least some of them could be put into their own partition_ file as follows:

Wouldn't it be best to just move hash partition tests into hash_part?
Leave all the other stuff where it is?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Ashutosh Bapat

Дата:

13 апреля 2018 г., 09:35:15

On Fri, Apr 13, 2018 at 7:45 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/13 1:47, Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> Here's an idea.  Why don't we move the function/opclass creation lines
>>> to insert.sql, without the DROPs, and use the same functions/opclasses
>>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>>> partition_prune.sql, i.e. not recreate what are essentially the same
>>> objects three times?  This also leaves them around for the pg_upgrade
>>> test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> Do you mean to do that for *all* files that have tests exercising some
> partitioning code, even if it's just one test?  I can see that tests in at
> least some of them could be put into their own partition_ file as follows:
>
> partition_insert (including tests in insert_conflict)
> partition_update
> partition_triggers
> partition_indexing (indexing.sql added when partitioned indexes went in)
> partition_ddl (for the tests in create_table and alter_table)
>

We haven't generally created test files specific to a table type for
example temporary tables or unlogged tables, instead have created
files by SQL commands. But then that's not true for indexes; we have
separate files for indexes and we also have separate file for
materialized views and also for various data types. So, our test file
organization seems to have cut across of SQL commands and object
types. But partitioning seems an area large enough to have files of
its own; we already have partition_join and partition_aggregate.

Do we want to move to a directory based organization for tests also,
where sql/ expected/ will have directories within them for various
types of objects like partitioned tables, indexes, regular tables,
datatypes etc. and each of those will have files organized by sql
commands? An immediate question arises as to where to add the files
which exercises a mixture of objects; may be in a directory
corresponding to the primary object like materialized views over
partitioned tables, would fit materialized view (or just views?)
directory.

Whatever organization we want to use, it should be easy to find
testcases for relevant functionality e.g. all tests for partitioned
tables or all alter table command tests.

> What about the tests in inherit.sql that start with:
>
> --
> -- Check that constraint exclusion works correctly with partitions using
> -- implicit constraints generated from the partition bound information.
> --
>
> Maybe, just move all of them to partition_prune.sql, because we no longer
> use constraint exclusion for pruning?

I think we need to have some testcases somwhere to test constraint
exclusion on partitions and partitioned tables, those do not
necessarily fit partition pruning.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

13 апреля 2018 г., 20:50:44

Robert Haas wrote:
> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Here's an idea.  Why don't we move the function/opclass creation lines
> > to insert.sql, without the DROPs, and use the same functions/opclasses
> > in the three tests insert.sql, alter_table.sql, hash_part.sql and
> > partition_prune.sql, i.e. not recreate what are essentially the same
> > objects three times?  This also leaves them around for the pg_upgrade
> > test, which is not a bad thing.
> 
> That sounds good, but maybe we should go further and move the
> partitioning tests out of generically-named things like insert.sql
> altogether and have test names that actually mention partitioning.

I don't think that's necessary to fix the problem that
partition_prune_hash.sql file has two expected output files.  If you
want to propose such a reorganization, feel free, but let's not hijack
the patch at hand.  For the record, I'm not a fan of the idea.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Robert Haas

Дата:

13 апреля 2018 г., 23:12:28

On Fri, Apr 13, 2018 at 10:50 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:
> Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> > Here's an idea.  Why don't we move the function/opclass creation lines
>> > to insert.sql, without the DROPs, and use the same functions/opclasses
>> > in the three tests insert.sql, alter_table.sql, hash_part.sql and
>> > partition_prune.sql, i.e. not recreate what are essentially the same
>> > objects three times?  This also leaves them around for the pg_upgrade
>> > test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> I don't think that's necessary to fix the problem that
> partition_prune_hash.sql file has two expected output files.  If you
> want to propose such a reorganization, feel free, but let's not hijack
> the patch at hand.  For the record, I'm not a fan of the idea.

Fair enough.  I don't think I'm hacking the thread much more than it
was already hijacked; and it was just a thought.  I haven't really
studied the tests well enough to have a really clear idea what a
better organization would look like.  It was just that, for example,
the commit that added hash partitioning added tests to 5 different
files, and some things had to be duplicated as a result.  It sounds
like what you've already done improves that, but I was wondering if
there's a way to do better.  I don't feel super-strongly about it
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

20 апреля 2018 г., 21:58:50

Amit Langote wrote:

> PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
> that there are hash partitioning related tests in create_table,
> foreign_key, and partition_join files as well.  Do we want to use the
> custom opclass in those files as well?

By the way, let me suggest 'git grep -i' instead.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

23 апреля 2018 г., 07:04:11

On 2018/04/21 0:58, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
>> PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
>> that there are hash partitioning related tests in create_table,
>> foreign_key, and partition_join files as well.  Do we want to use the
>> custom opclass in those files as well?
> 
> By the way, let me suggest 'git grep -i' instead.

Ah, thanks.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Marina Polyakova

Дата:

04 мая 2018 г., 15:32:23

Hello everyone in this thread!

I got a similar server crash as in [1] on the master branch since the 
commit 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails 
because the second argument ScalarArrayOpExpr is not a Const or an 
ArrayExpr, but is an ArrayCoerceExpr (see [2]):

=# create table list_parted (
   a varchar
) partition by list (a);
=# create table part_ab_cd partition of list_parted for values in ('ab', 
'cd');
=# CREATE OR REPLACE FUNCTION public.x_stl_text_integer (
)
RETURNS text STABLE AS
$body$
BEGIN
RAISE NOTICE 's text integer';
RETURN 1::text;
END;
$body$
LANGUAGE 'plpgsql';
=# explain (costs off) select * from list_parted where a in ('ab', 'cd', 
x_stl_text_integer());
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

[1] 
https://www.postgresql.org/message-id/CAKcux6nCsCmu9oUnnuKZkeBenYvUFbU2Lt4q2MFNEb7QErzn8w%40mail.gmail.com

[2] partprune.c, function match_clause_to_partition_key:
        if (IsA(rightop, Const))
        {
            ...
        }
        else
        {
            ArrayExpr  *arrexpr = castNode(ArrayExpr, rightop); # fails here
            ...
        }

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] path toward faster partition pruning

От

Michael Paquier

Дата:

07 мая 2018 г., 07:37:10

On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
> I got a similar server crash as in [1] on the master branch since the commit
> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
> an ArrayCoerceExpr (see [2]):

Indeed, I can see the crash.  I have been playing with this stuff and I
am in the middle of writing the patch, but let's track this properly for
now.
--
Michael

Вложения

signature.asc

Re: [HACKERS] path toward faster partition pruning

От

Marina Polyakova

Дата:

07 мая 2018 г., 13:30:12

On 07-05-2018 4:37, Michael Paquier wrote:
> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
>> I got a similar server crash as in [1] on the master branch since the 
>> commit
>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails 
>> because
>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, 
>> but is
>> an ArrayCoerceExpr (see [2]):
> 
> Indeed, I can see the crash.  I have been playing with this stuff and I
> am in the middle of writing the patch, but let's track this properly 
> for
> now.

Thank you very much!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] path toward faster partition pruning

От

Michael Paquier

Дата:

07 мая 2018 г., 13:56:22

On Mon, May 07, 2018 at 10:37:10AM +0900, Michael Paquier wrote:
> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
> > I got a similar server crash as in [1] on the master branch since the commit
> > 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
> > the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
> > an ArrayCoerceExpr (see [2]):
>
> Indeed, I can see the crash.  I have been playing with this stuff and I
> am in the middle of writing the patch, but let's track this properly for
> now.

So the problem appears when an expression needs to use
COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another,
which requires an execution state to be able to build the list of
elements.  The clause matching happens at planning state, so while there
are surely cases where this could be improved depending on the
expression type, I propose to just discard all clauses which do not
match OpExpr and ArrayExpr for now, as per the attached.  It would be
definitely a good practice to have a default code path returning
PARTCLAUSE_UNSUPPORTED where the element list cannot be built.

Thoughts?
--
Michael

Вложения

Re: [Suspect SPAM] Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

08 мая 2018 г., 13:07:41

Thank you Marina for the report and Michael for following up.

On 2018/05/07 16:56, Michael Paquier wrote:
> On Mon, May 07, 2018 at 10:37:10AM +0900, Michael Paquier wrote:
>> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
>>> I got a similar server crash as in [1] on the master branch since the commit
>>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
>>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
>>> an ArrayCoerceExpr (see [2]):
>>
>> Indeed, I can see the crash.  I have been playing with this stuff and I
>> am in the middle of writing the patch, but let's track this properly for
>> now.
> 
> So the problem appears when an expression needs to use
> COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another,
> which requires an execution state to be able to build the list of
> elements.  The clause matching happens at planning state, so while there
> are surely cases where this could be improved depending on the
> expression type, I propose to just discard all clauses which do not
> match OpExpr and ArrayExpr for now, as per the attached.  It would be
> definitely a good practice to have a default code path returning
> PARTCLAUSE_UNSUPPORTED where the element list cannot be built.
> 
> Thoughts?

I have to agree to go with this conservative approach for now.  Although
we might be able to evaluate the array elements by applying the coercion
specified by ArrayCoerceExpr, let's save that as an improvement to be
pursued later.

FWIW, constraint exclusion wouldn't prune in this case either (that is, if
you try this example with PG 10 or using HEAD as of the parent of
9fdb675fc5), but it doesn't crash like the new pruning code does.

Thanks again.

Regards,
Amit

Re: [Suspect SPAM] Re: [HACKERS] path toward faster partition pruning

От

Michael Paquier

Дата:

08 мая 2018 г., 15:55:15

On Tue, May 08, 2018 at 04:07:41PM +0900, Amit Langote wrote:
> I have to agree to go with this conservative approach for now.  Although
> we might be able to evaluate the array elements by applying the coercion
> specified by ArrayCoerceExpr, let's save that as an improvement to be
> pursued later.

Thanks for confirming.  Yes, non-volatile functions would be actually
safe, and we'd need to be careful about NULL handling as well, but
that's definitely out of scope for v11.

> FWIW, constraint exclusion wouldn't prune in this case either (that is, if
> you try this example with PG 10 or using HEAD as of the parent of
> 9fdb675fc5), but it doesn't crash like the new pruning code does.

Yeah, I have noticed that.
--
Michael

Вложения

signature.asc

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

08 мая 2018 г., 22:27:52

Michael Paquier wrote:

> So the problem appears when an expression needs to use
> COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another,
> which requires an execution state to be able to build the list of
> elements.  The clause matching happens at planning state, so while there
> are surely cases where this could be improved depending on the
> expression type, I propose to just discard all clauses which do not
> match OpExpr and ArrayExpr for now, as per the attached.  It would be
> definitely a good practice to have a default code path returning
> PARTCLAUSE_UNSUPPORTED where the element list cannot be built.
> 
> Thoughts?

I found a related crash and I'm investigating it further.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Alvaro Herrera

Дата:

09 мая 2018 г., 04:05:46

So I found that this query also crashed (using your rig),

create table coercepart (a varchar) partition by list (a);
create table coercepart_ab partition of coercepart for values in ('ab');
create table coercepart_bc partition of coercepart for values in ('bc');
create table coercepart_cd partition of coercepart for values in ('cd');
explain (costs off) select * from coercepart where a ~ any ('{ab}');

The reason for this crash is that gen_partprune_steps_internal() is
unable to generate any steps for the clause -- which is natural, since
the operator is not in a btree opclass.  There are various callers
of gen_partprune_steps_internal that are aware that it could return an
empty set of steps, but this one in match_clause_to_partition_key for
the ScalarArrayOpExpr case was being a bit too optimistic.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

partprune-coerce-array-2.patch

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

09 мая 2018 г., 08:06:45

Hi.

On 2018/05/09 7:05, Alvaro Herrera wrote:
> So I found that this query also crashed (using your rig),
> 
> create table coercepart (a varchar) partition by list (a);
> create table coercepart_ab partition of coercepart for values in ('ab');
> create table coercepart_bc partition of coercepart for values in ('bc');
> create table coercepart_cd partition of coercepart for values in ('cd');
> explain (costs off) select * from coercepart where a ~ any ('{ab}');
> 
> The reason for this crash is that gen_partprune_steps_internal() is
> unable to generate any steps for the clause -- which is natural, since
> the operator is not in a btree opclass.  There are various callers
> of gen_partprune_steps_internal that are aware that it could return an
> empty set of steps, but this one in match_clause_to_partition_key for
> the ScalarArrayOpExpr case was being a bit too optimistic.

Yeah, good catch!  That fixes the crash, but looking around that code a
bit, it seems that we shouldn't even have gotten up to the point you're
proposing to fix.  If saop_op is not in the partitioning opfamily, it
should have bailed out much sooner, that is, here:

    /*
     * In case of NOT IN (..), we get a '<>', which we handle if list
     * partitioning is in use and we're able to confirm that it's negator
     * is a btree equality operator belonging to the partitioning operator
     * family.
     */
    if (!op_in_opfamily(saop_op, partopfamily))
    {
        <snip>

        negator = get_negator(saop_op);
        if (OidIsValid(negator) && op_in_opfamily(negator, partopfamily))
        {
            <snip>
        }
+       else
+           /* otherwise, unsupported! */
+           return PARTCLAUSE_UNSUPPORTED;

Let me propose that we also have this along with the rest of the changes
you made in that part of the function.  So, attached is an updated patch.

Thanks,
Amit

Вложения

partprune-coerce-array-3.patch

Re: [HACKERS] path toward faster partition pruning

От

Michael Paquier

Дата:

09 мая 2018 г., 08:20:18

On Tue, May 08, 2018 at 07:05:46PM -0300, Alvaro Herrera wrote:
> The reason for this crash is that gen_partprune_steps_internal() is
> unable to generate any steps for the clause -- which is natural, since
> the operator is not in a btree opclass.  There are various callers
> of gen_partprune_steps_internal that are aware that it could return an
> empty set of steps, but this one in match_clause_to_partition_key for
> the ScalarArrayOpExpr case was being a bit too optimistic.

Indeed.

While looking at this code, is there any reason to not make
gen_partprune_steps static?  This is only used in partprune.c for now,
so the intention is to make it available for future patches?
--
Michael

Вложения

signature.asc

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

09 мая 2018 г., 08:29:32

On 2018/05/09 11:20, Michael Paquier wrote:
> While looking at this code, is there any reason to not make
> gen_partprune_steps static?  This is only used in partprune.c for now,
> so the intention is to make it available for future patches?

Yeah, making it static might be a good idea.  I had made it externally
visible, because I was under the impression that the runtime pruning
related code would want to call it from elsewhere within the planner.
But, instead it introduced a make_partition_pruneinfo() which in turn
calls get_partprune_steps.

Thanks,
Amit

Re: [HACKERS] path toward faster partition pruning

От

David Rowley

Дата:

09 мая 2018 г., 08:31:39

On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/05/09 11:20, Michael Paquier wrote:
>> While looking at this code, is there any reason to not make
>> gen_partprune_steps static?  This is only used in partprune.c for now,
>> so the intention is to make it available for future patches?
>
> Yeah, making it static might be a good idea.  I had made it externally
> visible, because I was under the impression that the runtime pruning
> related code would want to call it from elsewhere within the planner.
> But, instead it introduced a make_partition_pruneinfo() which in turn
> calls get_partprune_steps.

Yeah. Likely left over from when run-time pruning was generating the
steps during execution rather than during planning.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

09 мая 2018 г., 11:01:26

On 2018/05/09 11:31, David Rowley wrote:
> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/05/09 11:20, Michael Paquier wrote:
>>> While looking at this code, is there any reason to not make
>>> gen_partprune_steps static?  This is only used in partprune.c for now,
>>> so the intention is to make it available for future patches?
>>
>> Yeah, making it static might be a good idea.  I had made it externally
>> visible, because I was under the impression that the runtime pruning
>> related code would want to call it from elsewhere within the planner.
>> But, instead it introduced a make_partition_pruneinfo() which in turn
>> calls get_partprune_steps.
> 
> Yeah. Likely left over from when run-time pruning was generating the
> steps during execution rather than during planning.

Here is a patch that does that.

Thanks,
Amit

Вложения

make-gen_partprune_steps-static.patch

Re: [HACKERS] path toward faster partition pruning

От

Michael Paquier

Дата:

09 мая 2018 г., 12:10:39

On Wed, May 09, 2018 at 02:01:26PM +0900, Amit Langote wrote:
> On 2018/05/09 11:31, David Rowley wrote:
>> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2018/05/09 11:20, Michael Paquier wrote:
>>>> While looking at this code, is there any reason to not make
>>>> gen_partprune_steps static?  This is only used in partprune.c for now,
>>>> so the intention is to make it available for future patches?
>>>
>>> Yeah, making it static might be a good idea.  I had made it externally
>>> visible, because I was under the impression that the runtime pruning
>>> related code would want to call it from elsewhere within the planner.
>>> But, instead it introduced a make_partition_pruneinfo() which in turn
>>> calls get_partprune_steps.
>>
>> Yeah. Likely left over from when run-time pruning was generating the
>> steps during execution rather than during planning.
>
> Here is a patch that does that.

Thanks, Amit.

Alvaro, could it be possible to consider as well the patch I posted
here?
https://www.postgresql.org/message-id/20180424012042.GD1570@paquier.xyz

This removes a useless default clause in partprune.c and it got
forgotten in the crowd.  Just attaching it again here, and it can just
be applied on top of the rest.
--
Michael

On Wed, May 09, 2018 at 10:39:07AM -0300, Alvaro Herrera wrote:
> Michael Paquier wrote:
>> This removes a useless default clause in partprune.c and it got
>> forgotten in the crowd.  Just attaching it again here, and it can just
>> be applied on top of the rest.
>
> Done, thanks for insisting.

Thanks!
--
Michael

Вложения

signature.asc

Re: [HACKERS] path toward faster partition pruning

От

Amit Langote

Дата:

10 мая 2018 г., 06:48:50

On 2018/05/09 22:43, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2018/05/09 11:31, David Rowley wrote:
>>> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> On 2018/05/09 11:20, Michael Paquier wrote:
>>>>> While looking at this code, is there any reason to not make
>>>>> gen_partprune_steps static?  This is only used in partprune.c for now,
>>>>> so the intention is to make it available for future patches?
> 
>> Here is a patch that does that.
> 
> Pushed, thanks.

Thank you.

Regards,
Amit

Re: [HACKERS] path toward faster partition pruning

От

Marina Polyakova

Дата:

10 мая 2018 г., 15:36:22

On 09-05-2018 17:30, Alvaro Herrera wrote:
> Marina Polyakova wrote:
>> Hello everyone in this thread!
> 
>> I got a similar server crash as in [1] on the master branch since the 
>> commit
>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails 
>> because
>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, 
>> but is
>> an ArrayCoerceExpr (see [2]):
> 
> Hello Marina, thanks for reporting this.  I have pushed all fixes
> derived from this report -- thanks to Amit and Michaël for those.
> I verified your test case no longer crashes.  If you have more 
> elaborate
> test cases, please do try these too.

Hello, thank you all very much! :)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [HACKERS] path toward faster partition pruning

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения