Re: Performance improvement for joins where outer side is unique

Поиск
Список
Период
Сортировка
От David Rowley
Тема Re: Performance improvement for joins where outer side is unique
Дата
Msg-id CAKJS1f_i7dGh9ytc0DuaX3JJaN7Gp9AGXNbdtWjj0PkPzNzUeg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Performance improvement for joins where outer side is unique  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Performance improvement for joins where outer side is unique  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On 17 December 2015 at 05:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
0) I know the patch does not tweak costing - any plans in this
   direction? Would it be possible to simply use the costing used by
   semijoin?


Many thanks for looking at this.

The patch does tweak the costings so that unique joins are costed in the same way as semi joins. It's a very small change, so easy to miss.

For example, see the change in initial_cost_nestloop()  

-       if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+       if (jointype == JOIN_SEMI ||
+               jointype == JOIN_ANTI ||
+               unique_inner)
        {

Also see the changes in final_cost_nestloop() and final_cost_hashjoin()


1) nodeHashjoin.c (and other join nodes)

I've noticed we have this in the ExecHashJoin() method:

 /*
  * When the inner side is unique or we're performing a
  * semijoin, we'll consider returning the first match, but
  * after that we're done with this outer tuple.
  */
 if (node->js.unique_inner)
     node->hj_JoinState = HJ_NEED_NEW_OUTER;

That seems a bit awkward because the comment speaks about unique joins *OR* semijoins, but the check only references unique_inner. That of course works because we do this in ExecInitHashJoin():

        switch (node->join.jointype)
        {
                case JOIN_SEMI:
                        hjstate->js.unique_inner = true;
                        /* fall through */

Either some semijoins may not be unique - in that case the naming is misleading at best, and we either need to find a better name or simply check two conditions like this:

 if (node->js.unique_inner || node->join.type == JOIN_SEMI)
    node->hj_JoinState = HJ_NEED_NEW_OUTER;

Or all semijoins are unique joins, and in that case the comment may need rephrasing. But more importantly, it begs the question why we're detecting this in the executor and not in the planner? Because if we detect it in executor, we can't use this bit in costing, for example.


The reason not to detect that in the planner is simply that unique_join is meant to mean that the relation on the inner side of the join will at most contain a single Tuple which matches each outer relation tuple, based on the join condition. I've added no detection for this in semi joins, and I'd rather not go blindly setting the flag to true in the planner as it simply may not be true for the semi join. At the moment that might not matter as we're only using the unique_join flag as an optimization in the join nodes, but I'd prefer not to do this as its likely we'll want to do more with this flag later, and I'd rather we keep the meaning well defined. You might argue that this is also true for semi joins, but if down the road somewhere we want to perform some checks on the inner relation before the join takes place, and in that case the Tuples of the relation might not have the same properties we claim they do.

But you're right that reusing the flag in the join nodes is not ideal, and the comment is not that great either. I'd really rather go down the path of either renaming the variable, or explaining this better in the comment. It seems unnecessary to check both for each tuple being joined. I'd imagine that might add a little overhead to joins which are not unique. 
 
How about changing the comment to:

/*
* In the event that the inner side has been detected to be
* unique, as an optimization we can skip searching for any
* subsequent matching inner tuples, as none will exist.
* For semijoins unique_inner will always be true, although
* in this case we don't match another inner tuple as this
* is the required semi join behavior.
*/

Alternatively or additionally we can rename the variable in the executor state, although I've not thought of a name yet that I don't find overly verbose: unique_inner_or_semi_join,  match_first_tuple_only.


2) analyzejoins.c

I see that this code in join_is_removable()

    innerrel = find_base_rel(root, innerrelid);

    if (innerrel->reloptkind != RELOPT_BASEREL)
        return false;

was rewritten like this:

    innerrel = find_base_rel(root, innerrelid);

    Assert(innerrel->reloptkind == RELOPT_BASEREL);

That suggests that previously the function expected cases where reloptkind may not be RELOPT_BASEREL, but now it'll error out int such cases. I haven't noticed any changes in the surrounding code that'd guarantee this won't happen, but I also haven't been able to come up with an example triggering the assert (haven't been trying too hard). How do we know the assert() makes sense?


I'd have changed this as this should be covered by the  if (!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only sets is_unique_join to true if specialjoin_is_unique_join() returns true and that function tests reloptkind to ensure its set to RELOPT_BASEREL and return false if the relation is not. Perhaps what is missing is a comment to explain the function should only be used on RELOPT_BASEREL type relations.

 

3) joinpath.c

- either "were" or "been" seems redundant

/* left joins were already been analyzed for uniqueness in mark_unique_joins() */


Good catch. Fixed 
 

4) analyzejoins.c

comment format broken

/*
 * mark_unique_joins
        Analyze joins in order to determine if their inner side is unique based
        on the join condition.
 */


Fixed 
 
5) analyzejoins.c

missing line before the comment

}
/*
 * rel_supports_distinctness
 *              Returns true if rel has some properties which can prove the relation
 *              to be unique over some set of columns.
 *

Fixed 

I've attached updated patches.

Thanks again for the review!

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Reducing ClogControlLock contention
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Remove array_nulls?