Hash cost

Поиск
Список
Период
Сортировка
От Dennis Haney
Тема Hash cost
Дата
Msg-id 406970CD.2060808@diku.dk
обсуждение исходный текст
Список pgsql-hackers
Hi

Could someone please try and explain why the cost estimator for the hash 
is implemented as it is? (cost_hashjoin in costsize.c)
Especially these issues:

First, there is the estimation on the number of rows and their size. 
ExecChooseHashTableSize() apparently trusts neither and doubles them. 
Thus the function estimates the input relation is 4 times larger than 
the rest of the optimizer thinks. Why is that?
And why is this doubling also applied to the size of HashJoinTupleData? 
But not applied twice to the size of estimated bytes the hash would use, 
a number used in the calculation on the number of batches?

Second, why does the optimizer first guess that there can be 10 values 
in a bucket and then afterwards spend a lot of time estimating this 
number for use in another calculation? Using numbers that was based on 
the guess that there can be 10 values...

Third, the calculation assumes that the most common value is most 
dominant by far, but that the other common value mean nothing?

Fourth, the hashfunction does not create any collisions between 
non-identical values? And multiple join qualifiers does not affect this 
either?

Fifth, a probe most often looks in a chain with the average number of 
buckets? I would assume that a lot more time is spent looking in the 
chains with the most buckets...


-- 
Dennis



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Dates BC.
Следующее
От: Josh Berkus
Дата:
Сообщение: Re: Better support for whole-row operations and composite types