Re: Explanation for bug #13908: hash joins are badly broken

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Explanation for bug #13908: hash joins are badly broken
Дата
Msg-id 15227.1454780876@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Explanation for bug #13908: hash joins are badly broken  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Explanation for bug #13908: hash joins are badly broken  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Explanation for bug #13908: hash joins are badly broken  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 02/06/2016 02:34 AM, Tom Lane wrote:
>> I have sussed what's happening in bug #13908. Basically, commit
>> 45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
>> memory") broke things for the case where a hash join is using a skew
>> table.

> Damn, that's an embarrassing oversight :-/

> I believe the attached patch should fix this by actually copying the 
> tuples into the densely allocated chunks. Haven't tested it though, will 
> do in a few hours.

Yeah, that's one fix approach I was contemplating last night.  (I think
the patch as written leaks memory and doesn't account for space usage
properly either, but certainly this is a direction we could take.)

The other answer I was thinking about was to get rid of the assumption
that iterating over the chunk storage is a valid thing to do, and
instead scan the hashbucket chains when we need to visit all tuples.
I really do not like the patch as designed, for several reasons:

* It incorporates a bespoke reimplementation of palloc into hash joins.
This is not a maintainable/scalable way to go about reducing memory
consumption.  It should have been done with an arm's-length API to a
new type of memory context, IMO (probably one that supports palloc but
not pfree, repalloc, or any chunk-header-dependent operations).

* No arm's-length API would conceivably allow remote callers to iterate
over all allocated chunks in the way this code does, which is why we need
to get rid of that behavior.

* There's no way to delete single tuples from the hash table given this
coding, which no doubt is why you didn't migrate the skew tuples into
this representation; but it doesn't seem like a very future-proof data
structure.

* Not doing anything for the skew tuples doesn't seem very good either,
considering the whole point of that sub-module is that there are likely
to be a lot of them.  I note also that while the idea of
ExecHashRemoveNextSkewBucket is to reduce memory consumed by the skew
table to make it available to the main hash table, in point of fact it's
unlikely that the freed space will be of any use at all, since it will be
in tuple-sized chunks far too small for dense_alloc's requests.  So the
spaceUsed bookkeeping being done there is an academic exercise unconnected
to reality, and we need to rethink what the space management plan is.

So I'm of the opinion that a great deal more work is needed here.
But it's not something we're going to be able to get done for 9.5.1,
or realistically 9.5.anything.  Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: silent data loss with ext4 / all current versions
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby