Обсуждение: Enable using IS NOT DISTINCT FROM in hash and merge joins

Поиск
Список
Период
Сортировка

Enable using IS NOT DISTINCT FROM in hash and merge joins

От
Chi Gao
Дата:

Hello,

 

We are using PostgreSQL to execute some SQL scripts “auto translated” from HIVE QL, where the join operator “<=>” is heavily used. The semantic same operator in PostgreSQL is “IS NOT DISTINCT FROM”.

 

However, we found when “IS NOT DISTINCT FROM” is used in joins, only nested loop plan can be generated, which is confirmed here https://www.postgresql.org/message-id/13950.1511879733%40sss.pgh.pa.us and here https://postgrespro.com/list/thread-id/2059856 .

 

In another discussion someone suggests using coalesce(…) to replace NULLs to some special value, but in similar situation as in that thread, we have no reliable way to conclude a special value for any expression.

 

So I hacked the PG10 code to support using “IS NOT DISTINCT FROM” in hash and merge joins (not touching the indexes). It works in our environment, but I want to know if my approach is making sense, or is going to make damage.

 

There are 6 kinds of changes, and to be honest, none of them I am confident is doing in correct way…so please advise:

    - I do this by first reversing the meaning of DistinctExpr, from “IS DISTINCT FROM” to “IS NOT DISTINCT FROM”, which will be simpler to process in joins, because “IS NOT DISTINCT FROM” is closer to “=”. (backend/parser/parse_expr.c, backend/utils/adt/ruleutils.c)

    - The execution logic of DistinctExpr internally already reverts the result, because above change cancels it out, I revert it back. (backend/executor/execExprInterp.c, backend/optimizer/path/clausesel.c)

- In hash joins, I need to tell the executor that “NULL matches NULL” when the operator is “IS NOT DISTINCT FROM”. I cannot figure out the best way for passing such information down, so I just ORed 0x8000000 to the operator Oid List. As no code in other parts is doing so, please advise a better approach, should I add a Bitmapset to pass the flags? Or should I define a new Node type to include both Oid and a bool flag?  (backend/executor/nodeHashjoin.c, backend/executor/nodeHash.c)

- To support merge join, I added a nulleqnull bool flag in SortSupportData to bypass the “stop merging earlier when NULLs is reached” logic when the join operator is DistinctExpr. I think there is a padding gap after “bool            abbreviate;”, so I add the new flag after that, just want to keep binary compatibility in case something depends on it… (backend/executor/nodeMergejoin.c, include/utils/sortsupport.h)

- In create_join_clause, reconsider_outer_join_clause, and reconsider_full_join_clause functions, the derived expression generated by call to build_implied_join_equality outputs OpExpr for DistictExpr, because they are same in definition, I just patch the resulting node back to DistinctExpr if input is DistinctExpr. (backend/optimizer/path/equivclass.c)

- All other changes are for necessary code paths only allow OpExpr, I added logic to allow DistinctExpr too.

 

The patch in attachment is based on commit 821200405cc3f25fda28c5f58d17d640e25559b8.

 

 

Thanks!

 

 

Gao, Chi

Beijing Microfun Co. Ltd.

 

Вложения

Re: Enable using IS NOT DISTINCT FROM in hash and merge joins

От
Alexander Kuzmenkov
Дата:
Hi,

I heard from my colleagues that we have an extension that does a similar 
thing. I'm attaching the source. It creates operator "==" for some data 
types, that works like "IS NOT DISTINCT FROM". You can then perform hash 
joins on this operator. Apparently the hash join machinery supports 
non-strict operators, but I'm not sure about the hash indexes.

I think the question we have to answer is what prevents us from having 
hash and merge joins on non-strict operators? Hash joins seem to work 
already, so we can just create a custom operator and use it. Merge join 
executor can be adapted to work as well, but the planner would require 
more complex changes. Just adding a check for DistinctExpr to 
check_mergejoinable probably breaks equivalence classes. The problem 
with merge join planning is that it has the notion of "mergejoinable 
operator", which is a strict btree equality operator, and it uses such 
operators both to perform merge joins and to conclude that some two 
variables must be equal (that is, create equivalence classes). If we are 
going to perform merge joins on some other kinds of operators, we have 
to disentangle these two uses. I had to do this to support merge joins 
on inequality clauses, you can take a look at this thread if you wish: 
https://www.postgresql.org/message-id/flat/b31e1a2d-5ed2-cbca-649e-136f1a7c4c31%40postgrespro.ru 


-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Вложения