Re: Avoiding hash join batch explosions with extreme skew and weird stats

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: Avoiding hash join batch explosions with extreme skew and weird stats
Дата
Msg-id CAAKRu_Z9umj-FNW2ta0yHzhK1LHMBsDFM_CFevHH78=xGiaqBg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Avoiding hash join batch explosions with extreme skew and weirdstats  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Avoiding hash join batch explosions with extreme skew and weirdstats  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code.

Thanks so much for the review, Tomas!

I've attached a rebased patch which also contains updates discussed
below.
 
I assume
you're working on a patch addressing the remaining TODOS, right?

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.
Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those. Otherwise, I
will prioritize parallel batch 0 spilling.

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.
David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.
 

I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.

I hear you about "stripe". I still quite like it, especially as compared
to its predecessor (originally, I called them chunks -- which is
impossible given that SharedTuplestoreChunks are a thing).

For ease of review, as you mentioned, I will keep the name for now. I am
open to changing it later, though.

I've been soliciting ideas for alternatives and, so far, folks have
suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
still on team "stripe" though, as it stands.
 

nodeHash.c
----------


1) MultiExecPrivateHash says this

   /*
    * Not subject to skew optimization, so either insert normally
    * or save to batch file if it belongs to another stripe
    */

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.


I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:

-        * or save to batch file if it belongs to another stripe
+       * or save to batch file if batch 0 falls back and we have
+       * already filled the hashtable up to space_allowed.


2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays).

I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.
 
Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.


Interesting -- it didn't even occur to me to combine the bitmap with the
inner side batch file data.
It definitely seems like a good idea to save the BufFile given that so
little data will likely go in it and that it has a 1-1 relationship with
inner side batches.

How might it work? Would you reserve some space at the beginning of the
file? When would you reserve the bytes (before adding tuples you won't
know how many bytes you need, so it might be hard to make sure there is
enough space.) Would all inner side files have space reserved or just
fallback batches?

--
Melanie Plageman
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: Default setting for enable_hashagg_disk
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: create database with template doesn't copy database ACL