Re: HashJoin w/option to unique-ify inner rel

Поиск

Список

Период

Сортировка

От	Lawrence, Ramon
Тема	Re: HashJoin w/option to unique-ify inner rel
Дата	16 апреля 2009 г. 22:22:02
Msg-id	6EEA43D22289484890D119821101B1DF05190E43@exchange20.mercury.ad.ubc.ca обсуждение исходный текст
Ответ на	HashJoin w/option to unique-ify inner rel (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: HashJoin w/option to unique-ify inner rel (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

> Upon further review, it appears that a big part of this problem is
> that cost_hashjoin() doesn't understand that it needs cost semi-joins
> differently from inner or left joins.  The bogus logic looks to be
> right here:
>     startup_cost += hash_qual_cost.startup;
>     run_cost += hash_qual_cost.per_tuple *
>         outer_path_rows * clamp_row_est(inner_path_rows *
> innerbucketsize) * 0.5;
>
> Of course, when the join type is JOIN_SEMI, we're going to stop
> looking after we find the first match, so this estimate is really far
> off.

The 8.3 version of cost_hashjoin() had a line like this:

joininfactor = join_in_selectivity(&path->jpath, root);

and a cost function like this:

run_cost += hash_qual_cost.per_tuple *    outer_path_rows * clamp_row_est(inner_path_rows *
innerbucketsize) * joininfactor * 0.5;

This compensated for IN joins being able to stop scanning a bucket once
a match is found.  You may consider something similar for a semi-join.
Having experimented with a lot of this code recently, there is some
potential for improvement on the bucket sizes, etc., but it is a
non-trivial problem.

I tested a similar query on TPCH 100M1Z on version 8.3:

select * from customer where c_custkey in (select o_custkey from orders)

and found that hash aggregate was marginally faster.  If you turn off
aggregation, it selects an IN hash join which is about 5% slower and the
planner is not too far off.  So, it would be definitely possible to
modify the cost function appropriately.

> >>> It's tempting to have Hash cheat and just peek at the node beneath
it
> >>> to see if it's a HashAggregate, in which case it could call a
special
> >>> method to request the whole hash. But it would have to know that
it's
> >>> just a plain uniquify and not implementing a GROUP BY.

If HashAggregate is faster, then the question is can you make it better
by avoiding building the hash structure twice.  I haven't considered all
the possibilities, but the situation you have used as an example, an IN
query, seems workable.  Instead of translating to a hash
aggregate/hash/hash join query plan, it may be possible to create a
special hash join node that does uniquefy.

The benefit is that the planner knows about it (instead of changing the
execution plan), you can be more accurate on costs for the hash join,
and you can optimize by using only one hash table construction. A
challenge that must be dealt with is handling the multi-batch case.  It
appears that hash aggregate does not currently handle this, but I may be
mistaken.

--
Ramon Lawrence

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Tiramisu Mokka
Дата: 16 апреля 2009 г., 20:39:15
Сообщение: postgres 8.4 beta1

Следующее

От: Tom Lane
Дата: 16 апреля 2009 г., 23:27:03
Сообщение: Re: HashJoin w/option to unique-ify inner rel

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: HashJoin w/option to unique-ify inner rel

Предыдущее

Следующее