Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Дата
Msg-id 22901.1227228246@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>)
Ответы Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets  ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>)
Список pgsql-hackers
"Lawrence, Ramon" <ramon.lawrence@ubc.ca> writes:
> We propose a patch that improves hybrid hash join's performance for
> large multi-batch joins where the probe relation has skew.
> ...
> The basic idea
> is to keep build relation tuples in a small in-memory hash table that
> have join values that are frequently occurring in the probe relation.

I looked at this patch a little.

I'm a tad worried about what happens when the values that are frequently
occurring in the outer relation are also frequently occurring in the
inner (which hardly seems an improbable case).  Don't you stand a severe
risk of blowing out the in-memory hash table?  It doesn't appear to me
that the code has any way to back off once it's decided that a certain
set of join key values are to be treated in-memory.  Splitting the main
join into more batches certainly doesn't help with that.

Also, AFAICS the benefit of this patch comes entirely from avoiding dump
and reload of tuples bearing the most common values, which means it's a
significant waste of cycles when there's only one batch.  It'd be better
to avoid doing any of the extra work in the single-batch case.

One thought that might address that point as well as the difficulty of
getting stats in nontrivial cases is to wait until we've overrun memory
and are forced to start batching, and at that point determine on-the-fly
which are the most common hash values from inspection of the hash table
as we dump it out.  This would amount to optimizing on the basis of
frequency in the *inner* relation not the outer, but offhand I don't see
any strong theoretical basis why that wouldn't be just as good.  It
could lose if the first work_mem worth of inner tuples isn't
representative of what follows; but this hardly seems more dangerous
than depending on MCV stats that are for the whole outer relation rather
than the portion of it being selected.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Proposal: new border setting in psql
Следующее
От: Ron Mayer
Дата:
Сообщение: Re: Re: Updated interval patches - ECPG [was, intervalstyle....]