Re: Partition-wise join for join between (declaratively) partitioned tables

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: Partition-wise join for join between (declaratively) partitioned tables
Дата
Msg-id CAFjFpRfdMsGVmmQ797NhodaovnJ==WiKrj=CG0AKDP0HytdM8g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Partition-wise join for join between (declaratively) partitioned tables  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Partition-wise join for join between (declaratively) partitioned tables  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Список pgsql-hackers


On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.
 
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my last
mail on this thread.

Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE:  join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE:  join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: sslmode=require fallback
Следующее
От: Ashutosh Bapat
Дата:
Сообщение: Re: Partition-wise join for join between (declaratively) partitioned tables