Re: tweaking NTUP_PER_BUCKET

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: tweaking NTUP_PER_BUCKET
Дата
Msg-id CA+TgmoY-0L5iCnwC1xzrXTRqWc7VBQDtymxiCUp+-TugvLAoyg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: tweaking NTUP_PER_BUCKET  ("Tomas Vondra" <tv@fuzzy.cz>)
Ответы Re: tweaking NTUP_PER_BUCKET  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
On Tue, Jul 8, 2014 at 12:06 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 8 Červenec 2014, 16:16, Robert Haas wrote:
>> On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
>>> relations it may be way cheaper to use higher NTUP_PER_BUCKET values
>>> instead of increasing the number of batches (resulting in repeated scans
>>> of the outer table). I think it's important (but difficult) to keep
>>> these
>>> things somehow balanced.
>>
>> Right, I think that's clear.  I'm just pointing out that you get to
>> decide: you can either start with a larger NTUP_PER_BUCKET and then
>> reduce it if you enough memory left, or you can start with a smaller
>> NTUP_PER_BUCKET and then increase it if you run short of memory.
>
> I don't think those two approaches are equal.
>
> With the approach I took, I can use a compromise value (NTUP=4) initially,
> and only resize the hash table once at the end (while keeping the amount
> of memory under work_mem all the time).
>
> With the "NTUP=1 and increase in case of memory pressure" you have to
> shrink the table immediately (to fit into work_mem), and if the hash table
> gets split into multiple batches you're suddenly in a situation with lower
> memory pressure and you may need to increase it again.

True.  On the other hand, this really only comes into play when the
estimates are wrong.  If you know at the start how many tuples you're
going to need to store and how big they are, you determine whether
NTUP_PER_BUCKET=1 is going to work before you even start building the
hash table.  If it isn't, then you use fewer buckets right from the
start.  If we start by estimating a small value for NTUP_PER_BUCKET
and then let it float upward if we turn out to have more tuples than
expected, we're optimizing for the case where our statistics are
right.  If we start by estimating a larger value for NTUP_PER_BUCKET
than what we think we need to fit within work_mem, we're basically
guessing that our statistics are more likely to be wrong than to be
right.  I think.

> I wonder if this is really worth the effort - my guess is it's efficient
> only if large portion of buckets is not visited (and thus does not need to
> be split) at all. Not sure how common that is (our workloads certainly are
> not like that).

Yeah.  It may be a bad idea.  I threw it out there as a possible way
of trying to mitigate the worst case, which is when you trouble to
build the hash table and then make very few probes.  But that may not
be worth the complexity that this would introduce.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Следующее
От: Christoph Berg
Дата:
Сообщение: Re: Securing "make check" (CVE-2014-0067)