Re: Avoiding hash join batch explosions with extreme skew and weird stats

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: Avoiding hash join batch explosions with extreme skew and weird stats
Дата
Msg-id CAAKRu_bBp7qnpCSpqn9_Ukm-9gE_mdT+S_qz=PhXZkmCq2RMqw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers


On Fri, Jun 7, 2019 at 7:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I'm not sure I understand why you would need to compare the original
> tuples to the unmatched tuples file.

I think I was confused.  Actually, I'm still not sure I understand this part:

> Then, you iterate again through the outer side a third time to join it
> to the unmatched tuples in the unmatched tuples file (from the first
> chunk) and write the following to a new unmatched tuples file:
> 5
> 9
> 11
> 11

and likewise here

> Then you iterate a fifth time through the outer side to join it to the
> unmatched tuples in the unmatched tuples file (from the second chunk)
> and write the following to a new unmatched tuples file:
> 11
> 11

So you refer to joining the outer side to the unmatched tuples file,
but how would that tell you which outer tuples had no matches on the
inner side?  I think what you'd need to do is anti-join the unmatched
tuples file to the current inner batch.  So the algorithm would be
something like:

for each inner batch:
  for each outer tuple:
    if tuple matches inner batch then emit match
    if tuple does not match inner batch and this is the first inner batch:
      write tuple to unmatched tuples file
  if this is not the first inner batch:
    for each tuple from the unmatched tuples file:
      if tuple does not match inner batch:
        write to new unmatched tuples file
    discard previous unmatched tuples file and use the new one for the
next iteration

for each tuple in the final unmatched tuples file:
  null-extend and emit

If that's not what you have in mind, maybe you could provide some
similar pseudocode?  Or you can just ignore me. I'm not trying to
interfere with an otherwise-fruitful discussion by being the only one
in the room who is confused...


Yep, the pseudo-code you have above is exactly what I was thinking. I
have been hacking around on my fork implementing this for the
non-parallel hashjoin (my idea was to implement a parallel-friendly
design but for the non-parallel-aware case and then go back and
implement it for the parallel-aware hashjoin later) and have some
thoughts.

I'll call the whole adaptive hashjoin fallback strategy "chunked
hashloop join" for the purposes of this description.
I'll abbreviate the three approaches we've discussed like this:

Approach A is using a separate data structure (a bitmap was the
suggested pick) to track the match status of each outer tuple

Approach B is the inner-join + anti-join writing out unmatched tuples
to a new file for every iteration through the outer side batch (for
each chunk of inner)

Approach C is setting a match bit in the tuple and then writing all
outer side tuples out for every iteration through the outer side (for
each chunk of inner)

To get started with I implemented the inner side chunking logic which
is required for all of the approaches. I did a super basic version
which only allows nbatches to be increased during the initial
hashtable build--not during loading of subsequent batches--if a batch
after batch 0 runs out of work_mem, it just loads what will fit and
saves the inner page offset in the hashjoin state.

Part of the allure of approaches B and C for me was that they seemed
like they would require less code complexity and concurrency control
because you could just write out the unmatched tuples (to probably a
SharedTupleStore) without having to care about their original order or
page offset. It seemed like it didn't require treating a spill file
like it permits random access nor treating the tuples as ordered in a
SharedTupleStore.

The benefit I saw of approach B over approach C was that, in the case
where more tuples are matches, it requires fewer writes than approach
C--at the cost of additional reads. It would require at most the same
number of writes as approach C.

Approach B turned out to be problematic for many reasons. First of
all, with approach B, you end up having to keep track of an additional
new spill file for unmatched outer tuples for every chunk of the inner
side. Each spill file could have a different number of tuples, so, any
reuse of the file seems difficult to get right. For approach C (which
I did not try to implement), it seems like you could get away with
only maintaining two spill files for the outer side--one to be read
from and one to write to. I'm sure it is more complicated than this.
However, it seemed like, for approach B you would need to create and
destroy entirely new unmatched tuple spill files for every chunk.

Approach B was not simpler when it came to the code complexity of the
state machine either -- you have to do something different for the
first chunk than the other chunks (write to the unmatched tups file
but read from the original spill file, whereas other chunks require
writing to the unmatched tups file and reading from the unmatched tups
file), which requires complexity in the state machine (and, I imagine,
worker orchestration in the parallel implementation). And, you still
have to process all of the unmatched tups, null-extend them, and emit
them before advancing the batch.

So, I decided to try out approach A. The crux of the idea (my
understanding of it, at least) is to keep a separate data structure
which has the match status of each outer tuple in the batch. The
discussion was to do this with a bitmap in a file, but, I started with
doing it with a list in memory.

What I have so far is a list of structs--one for each outer
tuple--where each struct has a match flag and the page offset of that
tuple in the outer spill file. I add each struct to the list when I am
getting each tuple from a spill file in HJ_NEED_NEW_OUTER state to
join to the first chunk of the inner, and, since I only do this when I
am getting an outer tuple from the spill file, I also grab the page
offset and set it in the struct in the list.

As I am creating the list, and, while processing each subsequent chunk
of the inner, if the tuple is a match, I set the match flag to true in
that outer tuple's member of the list.

Then, after finishing the whole inner batch, I loop through the list,
and, for each unmatched tuple, I go to that offset in the spill file
and get that tuple and NULL-extend and emit it.

(Currently, I have a problem with the list and it doesn't produce
correct results yet.)

Thinking about how to move from my list of offsets to using a bitmap,
I got confused.

Let me try to articulate what I think the bitmap implementation would look
like:

Before doing chunked hashloop join for any batch, we would need to
know how many tuples are in the outer batch to make the bitmap the
correct size.

We could do this either with one loop through the whole outer batch
file right before joining it to the inner batch (an extra loop).

Or we could try and do it during the first read of the outer relation
when processing batch 0 and keep a data structure with each batch
number mapped to the number of outer tuples spilled to that batch.

Then, once we have this number, before joining the outer to the first
chunk of the inner, we would generate a bitmap with ntuples in outer
batch number of bits and save it somewhere (eventually in a file,
initially in the hjstate).

Now, I am back to the original problem--how do you know which bit to
set without somehow numbering the tuples with a unique identifier? Is
there anything that uniquely identifies a spill file tuple except its
offset?

--
Melanie Plageman

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Status of the table access method work
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Minimal logical decoding on standbys