Re: Avoiding hash join batch explosions with extreme skew and weird stats

Поиск

Список

Период

Сортировка

От	Melanie Plageman
Тема	Re: Avoiding hash join batch explosions with extreme skew and weird stats
Дата	20 мая 2019 г. 22:05:56
Msg-id	CAAKRu_Zj9j1RBe5osRmLLNK87-QzR+8O8K3d9nQNAmUL7y8WSA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Avoiding hash join batch explosions with extreme skew and weird stats (Thomas Munro <thomas.munro@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Sun, May 19, 2019 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Admittedly I don't have a patch, just a bunch of handwaving. One
>> reason I haven't attempted to write it is because although I know how
>> to do the non-parallel version using a BufFile full of match bits in
>> sync with the tuples for outer joins, I haven't figured out how to do
>> it for parallel-aware hash join, because then each loop over the outer
>> batch could see different tuples in each participant. You could use
>> the match bit in HashJoinTuple header, but then you'd have to write
>> all the tuples out again, which is more IO than I want to do. I'll
>> probably start another thread about that.
>
> Could you explain more about the implementation you are suggesting?
>
> Specifically, what do you mean "BufFile full of match bits in sync with the
> tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple. To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch. After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()). To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table. Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately. But that simple strategy won't work if the
hash table holds only part of the inner batch. Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop. But the later loops don't know on their own either. So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops. So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side. To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match. To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling. So one idea
is a BufFile that has one bit per outer tuple in the batch. In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

That makes sense. Thanks for the detailed explanation.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO...

If you set the has_match flag in the tuple header itself, wouldn't you only
need to write the tuples from the outer batch back out that don't have
matches?

> If so, why do you need to keep track of the outer tuples seen?
> If you are going to loop through the whole outer side for each tuple on the
> inner side, it seems like you wouldn't need to.

The idea is to loop through the whole outer batch for every
work_mem-sized inner batch fragment, not every tuple. Though in
theory it could be as small as a single tuple.

> Could you make an outer "batch" which is the whole of the outer relation? That
> is, could you do something like: when hashing the inner side, if re-partitioning
> is resulting in batches that will overflow spaceAllowed, could you set a flag on
> that batch use_NLJ and when making batches for the outer side, make one "batch"
> that has all the tuples from the outer side which the inner side batch which was
> flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch. The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

Sorry, my suggestion was inaccurate and unclear: I was basically suggesting
that once you have all batches created for outer and inner sides, for a
given inner side batch that does not fit in memory, for each outer tuple in
the corresponding outer batch file, load and join all of the chunks of the
inner batch file. That way, before you emit that tuple, you have checked
all of the corresponding inner batch.

Thinking about it now, I realize that that would be worse in all cases than
what you are thinking of -- joining the outer side batch with the inner
side batch chunk that fits in memory and marking the BufFile bit
representing that outer side tuple as "matched" and only emitting it with a
NULL from the inner side after all chunks have been processed.

Melanie Plageman

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Justin Pryzby
Дата: 20 мая 2019 г., 21:20:01
Сообщение: Re: clean up docs for v12

Следующее

От: Dmitry Dolgov
Дата: 20 мая 2019 г., 22:14:30
Сообщение: Re: VACUUM fails to parse 0 and 1 as boolean value

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Предыдущее

Следующее