Re: Performance improvement for joins where outer side is unique

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Re: Performance improvement for joins where outer side is unique
Дата
Msg-id CANP8+j+vxfEf3dqhnzKAnDopnnUM-Zjb90gNT=haEQNQePj9Mw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Performance improvement for joins where outer side is unique  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
Список pgsql-hackers
On 17 December 2015 at 00:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
    1) nodeHashjoin.c (and other join nodes)

    I've noticed we have this in the ExecHashJoin() method:

      /*
       * When the inner side is unique or we're performing a
       * semijoin, we'll consider returning the first match, but
       * after that we're done with this outer tuple.
       */
      if (node->js.unique_inner)
          node->hj_JoinState = HJ_NEED_NEW_OUTER;

    That seems a bit awkward because the comment speaks about unique
    joins *OR* semijoins, but the check only references unique_inner.
    That of course works because we do this in ExecInitHashJoin():

             switch (node->join.jointype)
             {
                     case JOIN_SEMI:
                             hjstate->js.unique_inner = true;
                             /* fall through */

    Either some semijoins may not be unique - in that case the naming is
    misleading at best, and we either need to find a better name or
    simply check two conditions like this:

      if (node->js.unique_inner || node->join.type == JOIN_SEMI)
         node->hj_JoinState = HJ_NEED_NEW_OUTER;

    Or all semijoins are unique joins, and in that case the comment may
    need rephrasing. But more importantly, it begs the question why
    we're detecting this in the executor and not in the planner? Because
    if we detect it in executor, we can't use this bit in costing, for
    example.


The reason not to detect that in the planner is simply that unique_join
is meant to mean that the relation on the inner side of the join will at
most contain a single Tuple which matches each outer relation tuple,
based on the join condition. I've added no detection for this in semi
joins, and I'd rather not go blindly setting the flag to true in the
planner as it simply may not be true for the semi join. At the moment
that might not matter as we're only using the unique_join flag as an
optimization in the join nodes, but I'd prefer not to do this as its
likely we'll want to do more with this flag later, and I'd rather we
keep the meaning well defined. You might argue that this is also true
for semi joins, but if down the road somewhere we want to perform some
checks on the inner relation before the join takes place, and in that
case the Tuples of the relation might not have the same properties we
claim they do.

But you're right that reusing the flag in the join nodes is not ideal,
and the comment is not that great either. I'd really rather go down the
path of either renaming the variable, or explaining this better in the
comment. It seems unnecessary to check both for each tuple being joined.
I'd imagine that might add a little overhead to joins which are not unique.

I'd be very surprised it that had any measurable impact.

How about changing the comment to:

/*
* In the event that the inner side has been detected to be
* unique, as an optimization we can skip searching for any
* subsequent matching inner tuples, as none will exist.
* For semijoins unique_inner will always be true, although
* in this case we don't match another inner tuple as this
* is the required semi join behavior.
*/

Alternatively or additionally we can rename the variable in the executor
state, although I've not thought of a name yet that I don't find overly
verbose: unique_inner_or_semi_join,  match_first_tuple_only.

I'd go with match_first_tuple_only.

+1

unique_inner is a state that has been detected, match_first_tuple_only is the action we take as a result. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: Making tab-complete.c easier to maintain
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Freeze avoidance of very large table.