Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions

Поиск

Список

Период

Сортировка

От	David G. Johnston
Тема	Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions
Дата	25 июля 2017 г. 08:09:07
Msg-id	CAKFQuwbR+SyKU_-Qicc_uTQW6gJhwFF+TOco-hw9NPMcKQ9X1g@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-general

Дерево обсуждения

On Mon, Jul 24, 2017 at 8:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[*docs]
If the data were perfectly distributed, with the same
* number of tuples going into each available bucket, then the bucketsize
* fraction would be 1/nbuckets. But this happy state of affairs will occur
* only if (a) there are at least nbuckets distinct data values, and (b)
* we have a not-too-skewed data distribution. Otherwise the buckets will
* be nonuniformly occupied.

Thanks, I have a better feel now. Using this example (200 inner relation rows) is pretty poor since at this scale there doesn't seem to be enough data to make a noticeable difference.

But anyway, the above comment is only being applied when dealing with a non-unique inner relation; however, the fraction used is 1/nbuckets for any unique relation regardless of its size.

if (IsA(inner_path, UniquePath))

innerbucketsize = 1.0 / virtualbuckets;

else

And to clarify for others only reading this...the 200 on the "VALUES" node is there because there are 200 literal values in the value_list. The 200 on the resulting Hash (and HashAggregate in the example) node is there because of DEFAULT_NUM_DISTINCT (changing the query limit to 300 only changed the former). Further, since it is only the default, the fraction used charged out is 1/10 instead of 1/200 that would used if the 200 were a real number instead - or 1/1024 if those 200 rows were known to be themselves unique.

For me, I'm seeing that the expected number of input rows doesn't factor into the innerbucketsize computation directly (possibly excepting a scaling factor adjustment).

I can understand better, now, why this seemingly perfect example of a semi-join query gets executed with an extra distinct/grouping node.

David J.

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [GENERAL] Perfomance of IN-clause with many elements and possible solutions