Re: Avoiding hash join batch explosions with extreme skew and weird stats

Поиск

Список

Период

Сортировка

От	Melanie Plageman
Тема	Re: Avoiding hash join batch explosions with extreme skew and weird stats
Дата	4 июня 2019 г. 00:10:21
Msg-id	CAAKRu_YuQ4gA7XzrH60BbevpjBXZJ_=wS++7DmWtkgG3BBQwfw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Avoiding hash join batch explosions with extreme skew and weird stats (Thomas Munro <thomas.munro@gmail.com>)
Ответы	Re: Avoiding hash join batch explosions with extreme skew and weird stats (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Sun, May 19, 2019 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Admittedly I don't have a patch, just a bunch of handwaving. One
>> reason I haven't attempted to write it is because although I know how
>> to do the non-parallel version using a BufFile full of match bits in
>> sync with the tuples for outer joins, I haven't figured out how to do
>> it for parallel-aware hash join, because then each loop over the outer
>> batch could see different tuples in each participant. You could use
>> the match bit in HashJoinTuple header, but then you'd have to write
>> all the tuples out again, which is more IO than I want to do. I'll
>> probably start another thread about that.
>
> Could you explain more about the implementation you are suggesting?
>
> Specifically, what do you mean "BufFile full of match bits in sync with the
> tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple. To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch. After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()). To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table. Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately. But that simple strategy won't work if the
hash table holds only part of the inner batch. Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop. But the later loops don't know on their own either. So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops. So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side. To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match. To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling. So one idea
is a BufFile that has one bit per outer tuple in the batch. In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO... Note that parallel hash doesn't support right/full joins today,
because of some complications about waiting and deadlocks that might
turn out to be relevant here too, and might be solvable (I should
probably write about that in another email), but left joins *are*
supported today so would need to be desupported if we wanted to add
loop-based escape valve but not deal with with these problems. That
doesn't seem acceptable, which is why I'm a bit stuck on this point,
and unfortunately it may be a while before I have time to tackle any
of that personally.

There was an off-list discussion at PGCon last week about doing this
hash looping strategy using the bitmap with match bits and solving the
parallel hashjoin problem by having tuple-identifying information
encoded in the bitmap which allowed each worker to indicate that an
outer tuple had a match when processing that inner side chunk and
then, at the end of the scan of the outer side, the bitmaps would be
OR'd together to represent a single view of the unmatched tuples from
that iteration.

I was talking to Jeff Davis about this on Saturday, and, he felt that
there might be a way to solve the problem differently if we thought of
the left join case as performing an inner join and an antijoin
instead.

Riffing on this idea a bit, I started trying to write a patch that
would basically emit a tuple if it matches and write the tuple out to
a file if it does not match. Then, after iterating through the outer
batch the first time for the first inner chunk, any tuples which do
not yet have a match are the only ones which need to be joined against
the other inner chunks. Instead of iterating through the outer side
original batch file, use the unmatched outer tuples file to do the
join against the next chunk. Repeat this for all chunks.

Could we not do this and avoid using the match bit? In the worst case,
you would have to write out all the tuples on the outer side (if none
match) nchunks times (chunk is the work_mem sized chunk of inner
loaded into the hashtable).

Melanie Plageman

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Chapman Flack
Дата: 04 июня 2019 г., 00:03:16
Сообщение: Re: Sort support for macaddr8

Следующее

От: Stephen Frost
Дата: 04 июня 2019 г., 00:20:42
Сообщение: Re: initdb recommendations

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Предыдущее

Следующее