Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Поиск

Список

Период

Сортировка

От	Andy Fan
Тема	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Дата	11 сентября 2023 г. 07:51:03
Msg-id	CAKU4AWoKiogNir=ESrm6F9pUwYFDfmsoNG9w9k1MG3VUbKM0Dw@mail.gmail.com обсуждение исходный текст
Ответ на	MergeJoin beats HashJoin in the case of multiple hash clauses (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Ответы	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Список	pgsql-hackers

Дерево обсуждения

Hi,

On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:

Hi, all.

Some of my clients use JOIN's with three - four clauses. Quite
frequently, I see complaints on unreasonable switch of JOIN algorithm to
Merge Join instead of Hash Join. Quick research have shown one weak
place - estimation of an average bucket size in final_cost_hashjoin (see
q2.sql in attachment) with very conservative strategy.
Unlike estimation of groups, here we use smallest ndistinct value across
all buckets instead of multiplying them (or trying to make multivariate
analysis).
It works fine for the case of one clause. But if we have many clauses,
and if each has high value of ndistinct, we will overestimate average
size of a bucket and, as a result, prefer to use Merge Join. As the
example in attachment shows, it leads to worse plan than possible,
sometimes drastically worse.
I assume, this is done with fear of functional dependencies between hash
clause components. But as for me, here we should go the same way, as
estimation of groups.

I can reproduce the visitation you want to improve and verify the patch

can do it expectedly. I think this is a right thing to do.

The attached patch shows a sketch of the solution.

I understand that this is a sketch of the solution, but the below changes still

make me confused.

+ if (innerbucketsize > virtualbuckets)

+ innerbucketsize = 1.0 / virtualbuckets;

innerbucketsize is a fraction of rows in all the rows, so it is between 0.0 and 1.0.

and virtualbuckets is the number of buckets in total (when considered the mutli

batchs), how is it possible for 'innerbucketsize > virtualbuckets' ? Am

I missing something?

Best Regards

Andy Fan

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Pavel Stehule
Дата: 11 сентября 2023 г., 07:34:56
Сообщение: Re: proposal: possibility to read dumped table's name from file

Следующее

От: Pavel Stehule
Дата: 11 сентября 2023 г., 07:57:12
Сообщение: Re: proposal: possibility to read dumped table's name from file

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Предыдущее

Следующее