Re: Explanation for bug #13908: hash joins are badly broken

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Explanation for bug #13908: hash joins are badly broken
Дата
Msg-id 56B64AAF.4060008@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Explanation for bug #13908: hash joins are badly broken  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Explanation for bug #13908: hash joins are badly broken  (Andres Freund <andres@anarazel.de>)
Re: Explanation for bug #13908: hash joins are badly broken  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 02/06/2016 06:47 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 02/06/2016 02:34 AM, Tom Lane wrote:
>>> I have sussed what's happening in bug #13908. Basically, commit
>>> 45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
>>> memory") broke things for the case where a hash join is using a skew
>>> table.
>
>> Damn, that's an embarrassing oversight :-/
>
>> I believe the attached patch should fix this by actually copying
>> the tuples into the densely allocated chunks. Haven't tested it
>> though, will do in a few hours.
>
> Yeah, that's one fix approach I was contemplating last night. (I
> think the patch as written leaks memory and doesn't account for
> space usage properly either, but certainly this is a direction we
> could take.)

Yes, it definitely needs more work (to free the original tuple copy 
after moving it into the dense_alloc chunk).

> The other answer I was thinking about was to get rid of the
> assumption that iterating over the chunk storage is a valid thing to
> do, and instead scan the hashbucket chains when we need to visit all
> tuples. I really do not like the patch as designed, for several
> reasons:
>
> * It incorporates a bespoke reimplementation of palloc into hash
> joins. This is not a maintainable/scalable way to go about reducing
> memory consumption. It should have been done with an arm's-length API
> to a new type of memory context, IMO (probably one that supports
> palloc but not pfree, repalloc, or any chunk-header-dependent
> operations).

Hmmm, interesting idea. I've been thinking about doing this using a 
memory context when writing the dense allocation, but was stuck in the 
"must support all operations" mode, making it impossible. Disallowing 
some of the operations would make it a viable approach, I guess.

> * No arm's-length API would conceivably allow remote callers to
> iterate over all allocated chunks in the way this code does, which is
> why we need to get rid of that behavior.

I'm not convinced we should throw away the idea of walking the chunks. I 
think it's kinda neat and I've been playing with postponing constructing 
the buckets until the very end of Hash build - it didn't work as good as 
expected, but I'm not ready to throw in the towel yet.

But perhaps the new memory context implementation could support some 
sort of iterator ...

> * There's no way to delete single tuples from the hash table given
> this coding, which no doubt is why you didn't migrate the skew tuples
> into this representation; but it doesn't seem like a very
> future-proof data structure.

I don't recall, it may be one of the reasons why the skew buckets use 
regular allocation. But I don't see how using a new type memory context 
could solve this, as it won't support pfree either. Maybe using a 
separate context for each skew bucket?

> * Not doing anything for the skew tuples doesn't seem very good
> either, considering the whole point of that sub-module is that there
> are likely to be a lot of them.

Maybe I'm missing something, but I thought that the values tracked in 
skew buckets are the MCVs from the outer table, in the hope that we'll 
reduce the amount of data that needs to be spilled to disk when batching 
the outer relation.

I don't see why there should be a lot of them in the inner relation 
(well, I can imagine cases like that, but in my experience those are 
rare cases).

> I note also that while the idea of ExecHashRemoveNextSkewBucket is
> to reduce memory consumed by the skew table to make it available to
> the main hash table, in point of fact it's unlikely that the freed
> space will be of any use at all, since it will be in tuple-sized
> chunks far too small for dense_alloc's requests. So the spaceUsed
> bookkeeping being done there is an academic exercise unconnected to
> reality, and we need to rethink what the space management plan is.

I don't follow. Why would these three things (sizes of allocations in 
skew buckets, chunks in dense allocator and accounting) be related?

FWIW the dense allocator actually made the memory accounting way more 
accurate, actually, as it eliminates most of the overhead that was not 
included in spaceUsed before.

> So I'm of the opinion that a great deal more work is needed here. But
> it's not something we're going to be able to get done for 9.5.1, or
> realistically 9.5.anything. Whereas adding additional klugery to
> ExecHashRemoveNextSkewBucket probably is doable over the weekend.

Agreed.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Filip Rembiałkowski
Дата:
Сообщение: Re: proposal: make NOTIFY list de-duplication optional
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Explanation for bug #13908: hash joins are badly broken