Re: Group by more efficient than distinct?

Поиск

Список

Период

Сортировка

От	Mark Mielke
Тема	Re: Group by more efficient than distinct?
Дата	21 апреля 2008 г. 22:39:21
Msg-id	480D41C3.3070608@mark.mielke.cc обсуждение исходный текст
Ответ на	Re: Group by more efficient than distinct? (Mark Mielke <mark@mark.mielke.cc>)
Список	pgsql-performance

Дерево обсуждения

Mark Mielke wrote:
> PFC wrote:
>>     Actually, the memory used by the hash depends on the number of
>> distinct values, not the number of rows which are processed...
>>     Consider :
>>
>> SELECT a GROUP BY a
>> SELECT a,count(*) GROUP BY a
>>
>>     In both cases the hash only holds discinct values. So if you have
>> 1 million rows to process but only 10 distinct values of "a", the
>> hash will only contain those 10 values (and the counts), so it will
>> be very small and fast, it will absorb a huge seq scan without
>> problem. If however, you have (say) 100 million distinct values for
>> a, using a hash would be a bad idea. As usual, divide the size of
>> your RAM by the number of concurrent connections or something.
>>     Note that "a" could be a column, several columns, anything, the
>> size of the hash will be proportional to the number of distinct
>> values, ie. the number of rows returned by the query, not the number
>> of rows processed (read) by the query. Same with hash joins etc,
>> that's why when you join a very small table to a large one Postgres
>> likes to use seq scan + hash join on the small table.
>
> This surprises me - hash values are lossy, so it must still need to
> confirm against the real list of values, which at a minimum should
> require references to the rows to check against?
>
> Is PostgreSQL doing something beyond my imagination? :-)

Hmmm... You did say distinct values, so I can see how that would work
for distinct. What about seq scan + hash join, though? To complete the
join, wouldn't it need to have a reference to each of the rows to join
against? If there is 20 distinct values and 200 rows in the small table
- wouldn't it need 200 references to be stored?

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

В списке pgsql-performance по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Group by more efficient than distinct?