Re: [HACKERS] Push down more full joins in postgres_fdw

Поиск

Список

Период

Сортировка

От	Etsuro Fujita
Тема	Re: [HACKERS] Push down more full joins in postgres_fdw
Дата	31 января 2017 г. 11:50:51
Msg-id	5c221a5d-06b3-fa0e-4ed5-0dc59ec6a38c@lab.ntt.co.jp обсуждение исходный текст
Ответ на	Re: [HACKERS] Push down more full joins in postgres_fdw (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

On 2017/01/30 21:05, Ashutosh Bapat wrote:
> On Mon, Jan 30, 2017 at 5:00 PM, Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp> wrote:
>> On 2017/01/27 21:25, Etsuro Fujita wrote:
>>> Sorry, I started thinking we went in the wrong direction.  I added to
>>> deparseSelectStmtForRel build_subquery_tlists, which creates a tlist for
>>> each subquery present in a given join tree prior to deparsing a whole
>>> remote query.  But that's nothing but an overhead; we need to create a
>>> tlist for the top-level query because we use it as fdw_scan_tlist, but
>>> not for subqueries, and we need to create retrieved_attrs for the
>>> top-level query while deparsing the targetlist in
>>> deparseExplicitTargetList, but not for subqueries.  So, we should need
>>> some work to avoid such a useless overhead.

>> I think we can avoid the useless overhead by adding a new function,
>> deparseSubqueryTargetList, that deparses expressions in the given relation's
>> reltarget, not the tlist, as a SELECT clause of the subquery representing
>> the given relation.  That would also allow us to make the 1-to-1
>> relationship between the subquery output columns and their aliases more
>> explicit, which was your original comment.  Please find attached the new
>> version.  (The patch doesn't need the patch to avoid duplicate construction
>> of the tlist, discussed upthread.)

> I have not looked at the patch, but the reason we use a tlist instead
> of list of expressions is because fdw_scan_tlist is expected in that
> form

Actually, we wouldn't need that to deparse a remote join query; a list 
of expressions would be enough.

> and we don't want two different representations one to deparse
> SELECT and one to interpret results from the foreign server. What you
> describe above seems to introduce exactly that hazard.

I agree with you to some extent, BUT:
* I don't think it's a good idea to create a tlist for each base/join 
relation that is deparsed as a subquery, to just avoid that hazard.  As 
I said above, that's nothing but an overhead.
* I think we would need to have two different representations for at 
least base relations; we use fpinfo->attrs_used to deparse a simple 
foreign table scan query for a base relation, but when deparsing the 
base relation as a subquery, we would need to use the list of 
expressions in the base relation's reltarget, to deparse a SELECT clause 
of the subquery, because we need to deparse a whole-row reference to the 
base relation as ROW, not all the user columns expanded, as shown in 
this extracted from the regression tests in the patch:

+ EXPLAIN (VERBOSE, COSTS OFF)
+ SELECT t1.c1, ss.a, ss.b FROM (SELECT c1 FROM "S 1"."T 3" WHERE c1 = 
50) t1 INNER JOIN (SELECT t2.c1, t3.c1 FROM (SELECT c1 FROM ft4 WHERE c1 
between 50 and 60) t2 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 between 50 
and 60) t3 ON (t2.c1 = t3.c1) WHERE t2.c1 IS NULL OR t2.c1 IS NOT NULL) 
ss(a, b) ON (TRUE) ORDER BY t1.c1, ss.a, ss.b FOR UPDATE OF t1;
+                                               QUERY PLAN 

+ 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+  LockRows
+    Output: "T 3".c1, ft4.c1, ft5.c1, "T 3".ctid, ft4.*, ft5.*
+    ->  Nested Loop
+          Output: "T 3".c1, ft4.c1, ft5.c1, "T 3".ctid, ft4.*, ft5.*
+          ->  Foreign Scan
+                Output: ft4.c1, ft4.*, ft5.c1, ft5.*
+                Relations: (public.ft4) FULL JOIN (public.ft5)
+                Remote SQL: SELECT s8.c1, s8.c2, s9.c1, s9.c2 FROM 
((SELECT c1, ROW(c1, c2, c3) FROM "S 1"."T 3" WHERE ((c1 >= 50)) AND 
((c1 <= 60))) s8(c1, c2) FULL JOIN (SELECT c1, ROW(c1, c2, c3) FROM "S 
1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) s9(c1, c2) ON (((s8.c1 = 
s9.c1)))) WHERE (((s8.c1 IS NULL) OR (s8.c1 IS NOT NULL))) ORDER BY 
s8.c1 ASC NULLS LAST, s9.c1 ASC NULLS LAST
+                ->  Hash Full Join
+                      Output: ft4.c1, ft4.*, ft5.c1, ft5.*
+                      Hash Cond: (ft4.c1 = ft5.c1)
+                      Filter: ((ft4.c1 IS NULL) OR (ft4.c1 IS NOT NULL))
+                      ->  Foreign Scan on public.ft4
+                            Output: ft4.c1, ft4.*
+                            Remote SQL: SELECT c1, c2, c3 FROM "S 1"."T 
3" WHERE ((c1 >= 50)) AND ((c1 <= 60))
+                      ->  Hash
+                            Output: ft5.c1, ft5.*
+                            ->  Foreign Scan on public.ft5
+                                  Output: ft5.c1, ft5.*
+                                  Remote SQL: SELECT c1, c2, c3 FROM "S 
1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))
+          ->  Materialize
+                Output: "T 3".c1, "T 3".ctid
+                ->  Seq Scan on "S 1"."T 3"
+                      Output: "T 3".c1, "T 3".ctid
+                      Filter: ("T 3".c1 = 50)
+ (25 rows)

>> Other changes:
>> * I went back to make_outerrel_subquery and make_innerrel_subquery, which
>> are flags to indicate whether to deparse the input relations as subqueries.
>> is_subquery_rel would work well for handling the cases of full joins with
>> restrictions on the input relations, but I noticed that that wouldn't work
>> well when extending to handle the cases where we deparse the input relations
>> as subqueries to evaluate PHVs remotely.

> I had objected to using a single variable instead of two previously
> and you had argued against it in [1]. There you had mentioned that PHV
> doesn't need two variables, but now you are taking the other stand,
> without any apparent reason. Can you please clarify it?

Sorry, I missed some cases.  Consider the join {A, B, C} satisfying the 
identity 3 in optimizer/README, ie,
    (A leftjoin B on (Pab)) leftjoin C on (Pbc)    = A leftjoin (B leftjoin C on (Pbc)) on (Pab)

where predicate Pbc fails for all-null B rows.  Assume that B has a PHV 
in the relation's reltarget.  While considering 2-way joins, 
foreign_join_ok would determine to deparse B as a subquery and hence set 
the B's is_subquery_rel to true for the join relation {A, B}.  However, 
if the planner selects the join order {A leftjoin (B leftjoin C on 
(Pbc)) on (Pab)} as the cheapest path for the join {A, B, C}, we would 
deparse B as a subquery according to the is_subquery_rel flag while 
creating the FROM clause entry for the join relation {B, C}.  That 
wouldn't look good.  That would be logically correct, though.  To avoid 
this, I'd like to go back to the two variables previously proposed.

Best regards,
Etsuro Fujita

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Push down more full joins in postgres_fdw