Re: Avoiding hash join batch explosions with extreme skew and weird stats

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: Avoiding hash join batch explosions with extreme skew and weird stats
Дата
Msg-id CAAKRu_YWp4G1ma8TLpAUUx9QvF-_xqyviuKOGXpt2MPaJENmuw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
Список pgsql-hackers


On Thu, Jun 13, 2019 at 7:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jun 11, 2019 at 2:35 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Let me try to articulate what I think the bitmap implementation would look
> like:
>
> Before doing chunked hashloop join for any batch, we would need to
> know how many tuples are in the outer batch to make the bitmap the
> correct size.

I was thinking that we wouldn't need to know this, because if the
bitmap is in a file, we can always extend it.  To imagine a needlessly
dumb implementation, consider:

set-bit(i):
   let b = i / 8
   while (b <= length of file in bytes)
      append '\0' to file
  read byte b from the file
  modify the byte you read by setting bit i % 8
  write the modified byte back to the file

In reality, we'd have some kind of buffer.  I imagine locality of
reference would be pretty good, because the outer tuples are coming to
us in increasing-tuple-number order.

If you want to prototype with an in-memory implementation, I'd suggest
just pallocing 8kB initially and repallocing when the tuple number
gets too big.  It'll be sorta inefficient, but who cares?  It's
certainly way cheaper than an extra pass over the data, and for a POC
it should be fine.


That approach makes sense. I have attached the first draft of a patch
I wrote to do parallel-oblivious hashjoin fallback. I haven't switched
to using the approach with a bitmap (or bytemap :) yet because I found
that using a linked list was easier to debug for now.

(Also, I did things like include the value of the outer tuple
attribute in the linked list nodes and assumed it was an int because
that is what I have been testing with--this would definitely be blown
away with everything else that is just there to help me with debugging
right now).

I am refactoring it now to change the state machine to make more sense
before changing the representation of the match statuses.

So, specifically, I am interested in high-level gut checks on the
state machine I am currently implementing (not reflected in this
patch).

This patch adds only one state -- HJ_ADAPTIVE_EMIT_UNMATCHED-- which
duplicates the logic of HJ_FILL_OUTER_TUPLE. Also, in this patch, the
existing HJ_NEED_NEW_BATCH state is used for new chunks. After
separating the logic that advanced the batches from that which loaded
a batch, it felt like NEED_NEW_CHUNK did not need to be its own state.
When a new chunk is required, if more exist, then the next one should
be loaded and outer should be rewound. Rewinding of outer was already
being done (seek to the beginning of the outer spill file is the
equivalent of "loading" it).

Currently, I am tracking a lot of state in the HashJoinState, which is
fiddly and error-prone.

New state machine (questions posed below):
To refactor the state machine, I am thinking of adding a new state
HJ_NEED_NEW_INNER_CHUNK which we would transition to when outer batch
is over. We would load the new chunk, rewind the outer, and transition
to HJ_NEED_NEW_OUTER. However, we would have to emit unmatched inner
tuples for that chunk (in case of ROJ) before that transition to
HJ_NEED_NEW_OUTER. This feels a little less clean because the
HJ_FILL_INNER_TUPLES state is transitioned into when the inner batch
is over as well. And, in the current flow I am sketching out, if the
inner batch is exhausted, we check if we should emit NULL-extended
inner tuples and then check if we should emit NULL-extended outer
tuples (since both batches are exhausted), whereas when a single inner
chunk is done being processed, we only want to emit NULL-extended
tuples for the inner side. Not to mention HJ_NEED_NEW_INNER_CHUNK
would transition to HJ_NEED_NEW_OUTER directly instead of first
advancing the batches. This can all be hacked around with if
statements, but, my point here is that if I am refactoring the state
machine to be more clear, ideally, it would be more clear.

A similar problem happens with HJ_FILL_OUTER_TUPLE and the
non-fallback case. For the fallback case, with this implementation,
you must wait until after exhausting the inner side to emit
NULL-extended outer tuples. In the non-fallback case -- a batch which
can fit in memory or, always, for batch 0 -- the unmatched outer
tuples are emitted as they are encountered.

It makes most sense in the context of the state machine, as far as I
can tell, after exhausting both outer and inner batch, to emit
NULL-extended inner tuples for that chunk and then emit NULL-extended
outer tuples for that batch.

So, requiring an additional read of the outer side to emit
NULL-extended tuples at the end of the inner batch would slow things
down for the non-fallback case, however, it seems like special casing
the fallback case would make the state machine much more confusing --
basically like mashing two totally different state machines together.

These questions will probably make a lot more sense with corresponding
code, so I will follow up with the second version of the state machine
patch once I finish it.

--
Melanie Plageman
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: PL/Python fails on new NetBSD/PPC 8.0 install
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock