Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

Поиск

Список

Период

Сортировка

От	Ashutosh Bapat
Тема	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Дата	13 сентября 2017 г. 22:56:17
Msg-id	CAFjFpRdHb_ZnoDTuBXqrudWXh3H1ibLkr6nHsCFT96fSK4DXtA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Ответы	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Список	pgsql-hackers

Дерево обсуждения

On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query.  Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Hmm.  The problem with this theory in my view is that it doesn't
>> explain why InitPlan() and ExecOpenScanRelation() lock the relations
>> instead of just assuming that they are already locked either by
>> AcquireExecutorLocks or by planning.  If ExecLockNonLeafAppendTables()
>> doesn't really need to take locks, then ExecOpenScanRelation() must
>> not need to do it either.  We invented ExecLockNonLeafAppendTables()
>> on the occasion of removing the scans of those tables which would
>> previously have caused ExecOpenScanRelation() to be invoked, so as to
>> keep the locking behavior unchanged.
>>
>> AcquireExecutorLocks() looks like an odd bit of code to me.  The
>> executor itself locks result tables in InitPlan() and then everything
>> else during InitPlan() and all of the others later on while walking
>> the plan tree -- comments in InitPlan() say that this is to avoid a
>> lock upgrade hazard if a result rel is also a source rel.  But
>> AcquireExecutorLocks() has no such provision; it just locks everything
>> in RTE order.  In theory, that's a deadlock hazard of another kind, as
>> we just talked about in the context of EIBO.  In fact, expanding in
>> bound order has made the situation worse: before, expansion order and
>> locking order were the same, so maybe having AcquireExecutorLocks()
>> work in RTE order coincidentally happened to give the same result as
>> the executor code itself as long as there are no result relations.
>> But this is certainly not true any more.  I'm not sure it's worth
>> expending a lot of time on this -- it's evidently not a problem in
>> practice, or somebody probably would've complained before now.
>>
>> But that having been said, I don't think we should assume that all the
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us.  I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible.  If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
>
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().
>
> I suggested that [2]
> -- (excerpt from [2])
>
> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
>
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) == 1) and allow that function to be called even
> for intermediate partitioned partitions for which the function will
> return NIL. That will leave the code in add_paths_to_append_rel()
> simple. Thoughts?
> --
>
> Amit Langote agrees with this. It kind of makes the assertion lame but
> keeps the code sane. What do you think?

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

#0  add_paths_to_append_rel (root=0x23e4308, rel=0x23fb768,
live_childrels=0x23ff5f0) at allpaths.c:1281
#1  0x000000000076e170 in set_append_rel_pathlist (root=0x23e4308,
rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:1262
#2  0x000000000076cf23 in set_rel_pathlist (root=0x23e4308,
rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:431
#3  0x000000000076e0f6 in set_append_rel_pathlist (root=0x23e4308,
rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:1247
#4  0x000000000076cf23 in set_rel_pathlist (root=0x23e4308,
rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:431
#5  0x000000000076cc22 in set_base_rel_pathlists (root=0x23e4308) at
allpaths.c:309

When add_paths_to_append_rel() (frame 0) is called for t1, it gets
partitioned_rels and stuffs it in append path/s it creates. But those
paths are flattened into the append paths created for the set
operations when add_paths_to_append_rels() is called from frame 3.
While flattening the append paths in accumulate_append_subpath() we do
not pull any partitioned_rels that are stuffed in those paths and thus
the final append path/s created does not have partitioned_rels in
there.

The same behaviour is retained by my v30 patchset [1]. I think we
should go ahead by fixing add_paths_to_append_rel() as done in that
patchset. partitioned_rels needs to be removed from append paths
anyway, so that code will be removed when we do that.

[1] https://www.postgresql.org/message-id/CAFjFpRfHkJW3G=_PnSUc6PbXJE48AWYwyRzaGqtfKzzoU4wXXw@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Robert Haas
Дата: 13 сентября 2017 г., 22:42:14
Сообщение: Re: [HACKERS] expanding inheritance in partition bound order

Следующее

От: Pierre Ducroquet
Дата: 13 сентября 2017 г., 23:17:15
Сообщение: Re: [HACKERS] Bug with pg_basebackup and 'shared' tablespace

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

Предыдущее

Следующее