Обсуждение: accounting for memory used for BufFile during hash joins

Поиск
Список
Период
Сортировка

accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
Hi,

I'm starting this thread mostly to keep track of patches developed in
response to issue [1] reported on pgsql-performance. The symptoms are
very simple - query performing a hash join ends up using much more
memory than expected (pretty much ignoring work_mem), and possibly
ending up with OOM.

The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

This is not ideal even if we happen to estimate everything correctly,
because for example with work_mem=4MB and nbatch=1024, it means we'll
use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the
work_mem for hash table itself.

But it can easily explode when we under-estimate the hash side. In the
pgsql-performance message, the hash side (with the patches applied,
allowing the query to complete) it looks like this:

  Hash (cost=2823846.37..2823846.37 rows=34619 width=930)
       (actual time=252946.367..252946.367 rows=113478127 loops=1)

So it's 3277x under-estimated. It starts with 16 batches, and ends up
adding more and more batches until it fails with 524288 of them (it gets
to that many batches because some of the values are very common and we
don't disable the growth earlier).

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The two attached patches both account for the BufFile memory, but then
use very different strategies when the work_mem limit is reached.

The first patch realizes it's impossible to keep adding batches without
breaking the work_mem limit, because at some point the BufFile will need
more memory than that. But it does not make sense to stop adding batches
entirely, because then the hash table could grow indefinitely.

So the patch abandons the idea of enforcing work_mem in this situation,
and instead attempts to minimize memory usage over time - it increases
the spaceAllowed in a way that ensures doubling the number of batches
actually reduces memory usage in the long run.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Neither of those patches tweaks ExecChooseHashTableSize() to consider
memory needed for BufFiles while deciding how many batches will be
needed. That's something that probably needs to happen, but it would not
help with the underestimate issue.

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

It's all just PoC quality, at this point, far from committable state.


[1] https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:


On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).

I want to see if I understand the implications of the per-slice-overflow patch
for execution of hashjoin:
For each bucket in the hashtable, when attempting to double the number of
batches, if the memory that the BufFile structs will occupy once this is done
will exceed the work_mem, split each batch into slices that fit into memory.
This means that, for each probe-side tuple hashing to that bucket, you have to
load every slice of each batch separately into memory to ensure correct results.
Is this right?
 

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

So, my initial reaction after taking a look at the patches is that I prefer the
first approach--increasing the resize threshhold. The second patch, the
per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
address what is, based on my understanding, an edge case.

--
Melanie Plageman

Re: accounting for memory used for BufFile during hash joins

От
Thomas Munro
Дата:
On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> The second patch tries to enforce work_mem more strictly. That would be
>> impossible if we were to keep all the BufFile structs in memory, so
>> instead it slices the batches into chunks that fit into work_mem, and
>> then uses a single "overflow" file for slices currently not in memory.
>> These extra slices can't be counted into work_mem, but we should need
>> just very few of them. For example with work_mem=4MB the slice is 128
>> batches, so we need 128x less overflow files (compared to per-batch).
>>
> I want to see if I understand the implications of the per-slice-overflow patch
> for execution of hashjoin:
> For each bucket in the hashtable, when attempting to double the number of
> batches, if the memory that the BufFile structs will occupy once this is done
> will exceed the work_mem, split each batch into slices that fit into memory.
> This means that, for each probe-side tuple hashing to that bucket, you have to
> load every slice of each batch separately into memory to ensure correct results.
> Is this right?

Seems expensive for large numbers of slices -- you need to join the
outer batch against each inner slice.  But I wonder how we'd deal with
outer joins, as Tom Lane asked in another thread:

https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us

>> I'm not entirely sure which of those approaches is the right one. The
>> first one is clearly just a "damage control" for cases where the hash
>> side turned out to be much larger than we expected. With good estimates
>> we probably would not have picked a hash join for those (that is, we
>> should have realized we can't keep work_mem and prohibit hash join).
>>
>> The second patch however makes hash join viable for some of those cases,
>> and it seems to work pretty well (there are some numbers in the message
>> posted to pgsql-performance thread). So I kinda like this second one.
>>
> So, my initial reaction after taking a look at the patches is that I prefer the
> first approach--increasing the resize threshhold. The second patch, the
> per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
> address what is, based on my understanding, an edge case.

Personally I'd like to make work_mem more reliable, even if it takes a
major new mechanism.

Stepping back a bit, I think there is something fishy about the way we
detect extreme skew.  Is that a factor in this case?  Right now we
wait until we have a batch that gets split into child batches
containing exactly 0% and 100% of the tuples before we give up.
Previously I had thought of that as merely a waste of time, but
clearly it's also a waste of unmetered memory.  Oops.

I think our extreme skew detector should go off sooner, because
otherwise if you have N nicely distributed unique keys and also M
duplicates of one bad egg key that'll never fit in memory, we keep
repartitioning until none of the N keys fall into the batch containing
the key for the M duplicates before we give up!  You can use
balls-into-bins maths to figure out the number, but I think that means
we expect to keep splitting until we have N * some_constant batches,
and that's just silly and liable to create massive numbers of
partitions proportional to N, even though we're trying to solve a
problem with M.  In another thread I suggested we should stop when
(say) 95% of the tuples go to one child batch.  I'm not sure how you
pick the number.

Of course that doesn't solve the problem that we don't have a better
plan for dealing with the M duplicates -- it just avoids a needless
batch explosions triggered by bad maths.  I think we need something
like Tomas's #2, or a way to switch to sort-merge, or some other
scheme.  I'm not sure how to compare the slice idea, which involves
processing outer tuples * inner slices with the sort-merge idea, which
involves sorting the inner and outer batch, plus the entirely new
concept of switching to another node at execution time.

I also wondered about reducing the buffer size of the BufFiles, but
that doesn't seem to be fixing the real problem.

-- 
Thomas Munro
https://enterprisedb.com



Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 9:58 AM Melanie Plageman
><melanieplageman@gmail.com> wrote:
>> On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>> The second patch tries to enforce work_mem more strictly. That would be
>>> impossible if we were to keep all the BufFile structs in memory, so
>>> instead it slices the batches into chunks that fit into work_mem, and
>>> then uses a single "overflow" file for slices currently not in memory.
>>> These extra slices can't be counted into work_mem, but we should need
>>> just very few of them. For example with work_mem=4MB the slice is 128
>>> batches, so we need 128x less overflow files (compared to per-batch).
>>>
>> I want to see if I understand the implications of the per-slice-overflow patch
>> for execution of hashjoin:
>> For each bucket in the hashtable, when attempting to double the number of
>> batches, if the memory that the BufFile structs will occupy once this is done
>> will exceed the work_mem, split each batch into slices that fit into memory.
>> This means that, for each probe-side tuple hashing to that bucket, you have to
>> load every slice of each batch separately into memory to ensure correct results.
>> Is this right?
>

>Seems expensive for large numbers of slices -- you need to join the
>outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1] https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

>But I wonder how we'd deal with outer joins, as Tom Lane asked in
>another thread:
>
>https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

>>> I'm not entirely sure which of those approaches is the right one. The
>>> first one is clearly just a "damage control" for cases where the hash
>>> side turned out to be much larger than we expected. With good estimates
>>> we probably would not have picked a hash join for those (that is, we
>>> should have realized we can't keep work_mem and prohibit hash join).
>>>
>>> The second patch however makes hash join viable for some of those cases,
>>> and it seems to work pretty well (there are some numbers in the message
>>> posted to pgsql-performance thread). So I kinda like this second one.
>>>
>> So, my initial reaction after taking a look at the patches is that I prefer the
>> first approach--increasing the resize threshhold. The second patch, the
>> per-slice-overflow patch, adds a major new mechanic to hashjoin in order to
>> address what is, based on my understanding, an edge case.
>
>Personally I'd like to make work_mem more reliable, even if it takes a
>major new mechanism.
>

Yeah, I share that attitude.

>Stepping back a bit, I think there is something fishy about the way we
>detect extreme skew.  Is that a factor in this case?  Right now we
>wait until we have a batch that gets split into child batches
>containing exactly 0% and 100% of the tuples before we give up.
>Previously I had thought of that as merely a waste of time, but
>clearly it's also a waste of unmetered memory.  Oops.
>

Yes, that was a factor in the reported query - the data set contained
significant number of duplicate values (~10%) but it took a while to
disable growth because there always happened to be a couple rows with a
different value.

>I think our extreme skew detector should go off sooner, because
>otherwise if you have N nicely distributed unique keys and also M
>duplicates of one bad egg key that'll never fit in memory, we keep
>repartitioning until none of the N keys fall into the batch containing
>the key for the M duplicates before we give up!  You can use
>balls-into-bins maths to figure out the number, but I think that means
>we expect to keep splitting until we have N * some_constant batches,
>and that's just silly and liable to create massive numbers of
>partitions proportional to N, even though we're trying to solve a
>problem with M.  In another thread I suggested we should stop when
>(say) 95% of the tuples go to one child batch.  I'm not sure how you
>pick the number.
>

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

>Of course that doesn't solve the problem that we don't have a better
>plan for dealing with the M duplicates -- it just avoids a needless
>batch explosions triggered by bad maths.  I think we need something
>like Tomas's #2, or a way to switch to sort-merge, or some other
>scheme.  I'm not sure how to compare the slice idea, which involves
>processing outer tuples * inner slices with the sort-merge idea, which
>involves sorting the inner and outer batch, plus the entirely new
>concept of switching to another node at execution time.
>

Do we actually check how many duplicates are there during planning? I
wonder if we could penalize (of even disable) hashjoins when there are
too many duplicates to fit into work_mem. Of course, that's going to be
tricky with filtering, and so on.

Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.

>I also wondered about reducing the buffer size of the BufFiles, but
>that doesn't seem to be fixing the real problem.
>

Yeah. It might help a bit, but it's very limited - even if you reduce
the buffer to say 1kB, it's just a factor of 8. And I'm not sure what
would be the impact on performance. 


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: accounting for memory used for BufFile during hash joins

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Do we actually check how many duplicates are there during planning?

Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.

            regards, tom lane



Re: accounting for memory used for BufFile during hash joins

От
Thomas Munro
Дата:
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
> >Seems expensive for large numbers of slices -- you need to join the
> >outer batch against each inner slice.
>
> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that.  So, I see three separate but related problems:

A.  Broken escape valve:  sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys.  We could fix that
with (say) a 95% rule.
B.  Lack of good alternative execution strategy when the escape valve
is triggered.  A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C.  Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
> >another thread:
> >
> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>
> That seems unrelated - we slice the array of batches, to keep memory
> needed for BufFile under control. The hash table remains intact, so
> there's no issue with outer joins.

Right, sorry, my confusion.  I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file.  Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

> I agree we should relax the 0%/100% split condition, and disable the
> growth sooner. But I think we should also re-evaluate that decision
> after a while - the data set may be correlated in some way, in which
> case we may disable the growth prematurely. It may not reduce memory
> usage now, but it may help in the future.
>
> It's already an issue, but it would be even more likely if we disabled
> growth e.g. with just 5%/95% splits.
>
> FWIW I believe this is mostly orthogonal issue to what's discussed in
> this thread.

But isn't problem A the root cause of problem C, in most cases?  There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it.  So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else.  It seems you don't actually have one of those cases
here, though?

I think we should fix problem A.  Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be).  And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code.  If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

> Switching to some other algorithm during execution moves the goal posts
> to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable.  So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

-- 
Thomas Munro
https://enterprisedb.com



Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Do we actually check how many duplicates are there during planning?
>
>Certainly that's part of the planner's cost estimates ... but it's
>only as good as the planner's statistical knowledge.
>

I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin. That's not quite what I
had in mind - I was thinking more about something along the lines "See the
larget group of duplicate values, disable hash join if it can't fit into
work_mem at all."

Of course, if the input estimates are off, that may not work too well. It
would certainly not help the query failing with OOM, because that was a
case of severe underestimate.

Or did you mean some other piece of code that I have missed.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>> >Seems expensive for large numbers of slices -- you need to join the
>> >outer batch against each inner slice.
>>
>> Nope, that's not how it works. It's the array of batches that gets
>> sliced, not the batches themselves.
>
>Sorry, I read only the description and not the code, and got confused
>about that.  So, I see three separate but related problems:
>
>A.  Broken escape valve:  sometimes we generate a huge number of
>batches while trying to split up many duplicates, because of the
>presence of other more uniformly distributed keys.  We could fix that
>with (say) a 95% rule.
>B.  Lack of good alternative execution strategy when the escape valve
>is triggered.  A batch cannot be split effectively, but cannot fit in
>work_mem, so for now we decide to ignore work_mem.
>C.  Unmetered explosion of batches and thus BufFiles, probably usually
>caused by problem A, but theoretically also due to a real need for
>partitions.
>

Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.

>> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
>> >another thread:
>> >
>> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>>
>> That seems unrelated - we slice the array of batches, to keep memory
>> needed for BufFile under control. The hash table remains intact, so
>> there's no issue with outer joins.
>
>Right, sorry, my confusion.  I thought you were describing
>https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
>can make that work for left outer joins without too much fuss by
>writing out a stream of match bits to a new temporary file.  Googling,
>I see that MySQL originally didn't support BNL for outer joins and
>then added some match flag propagation thing recently.)
>

Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.

>> I agree we should relax the 0%/100% split condition, and disable the
>> growth sooner. But I think we should also re-evaluate that decision
>> after a while - the data set may be correlated in some way, in which
>> case we may disable the growth prematurely. It may not reduce memory
>> usage now, but it may help in the future.
>>
>> It's already an issue, but it would be even more likely if we disabled
>> growth e.g. with just 5%/95% splits.
>>
>> FWIW I believe this is mostly orthogonal issue to what's discussed in
>> this thread.
>
>But isn't problem A the root cause of problem C, in most cases?  There
>must also be "genuine" cases of problem C that would occur even if we
>fix that, of course: someone has small work_mem, and data that can be
>effectively partitioned to fit it, but it just takes a huge number of
>partitions to do it.  So that we don't behave badly in those cases, I
>agree with you 100%: we should fix the memory accounting to count
>BufFile overheads as you are proposing, and then I guess ideally
>switch to our alternative strategy (BNL or sort-merge or ...) when we
>see that BufFiles are wasting to much work_mem and its time to try
>something else.  It seems you don't actually have one of those cases
>here, though?
>

Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.

But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.

And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).


>I think we should fix problem A.  Then handle problem C by accounting
>for BufFiles, and figure out a way to switch to our alternative
>strategy (currently: ignore work_mem), when we think that creating
>more BufFiles will be futile (not sure exactly what the rule there
>should be).  And then work on fixing B properly with a good strategy.
>Here's a straw-man idea: we could adopt BNL, and then entirely remove
>our repartitioning code.  If the planner's number of partitions turns
>out to be not enough, we'll just handle it using BNL loops.
>

Yeah, something like that.

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase.  That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again.  Rinse and repeat.

For C, I think we can use either of the two approaches I proposed. I like
the second option better, as it actually enforces work_mem. The first
option kinda helped with A too, although in different way, ana I think the
solution I outlined in the previous paragraph will work better.

No opinion regarding the switch to BNL, at the moment.

>> Switching to some other algorithm during execution moves the goal posts
>> to the next galaxy, I'm afraid.
>
>The main problem I'm aware of with sort-merge join is: not all that is
>hashable is sortable.  So BNL is actually the only solution I'm aware
>of for problem B that doesn't involve changing a fundamental thing
>about PostgreSQL's data type requirements.
>

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> Do we actually check how many duplicates are there during planning?

>> Certainly that's part of the planner's cost estimates ... but it's
>> only as good as the planner's statistical knowledge.

> I'm looking at the code, and the only place where I see code dealing with
> MCVs (probably the best place for info about duplicate values) is
> estimate_hash_bucketsize in final_cost_hashjoin.

What I'm thinking of is this bit in final_cost_hashjoin:

    /*
     * If the bucket holding the inner MCV would exceed work_mem, we don't
     * want to hash unless there is really no other alternative, so apply
     * disable_cost.  (The executor normally copes with excessive memory usage
     * by splitting batches, but obviously it cannot separate equal values
     * that way, so it will be unable to drive the batch size below work_mem
     * when this is true.)
     */
    if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
                           inner_path->pathtarget->width) >
        (work_mem * 1024L))
        startup_cost += disable_cost;

It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.

            regards, tom lane



Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 07, 2019 at 10:42:36AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>> Do we actually check how many duplicates are there during planning?
>
>>> Certainly that's part of the planner's cost estimates ... but it's
>>> only as good as the planner's statistical knowledge.
>
>> I'm looking at the code, and the only place where I see code dealing with
>> MCVs (probably the best place for info about duplicate values) is
>> estimate_hash_bucketsize in final_cost_hashjoin.
>
>What I'm thinking of is this bit in final_cost_hashjoin:
>
>    /*
>     * If the bucket holding the inner MCV would exceed work_mem, we don't
>     * want to hash unless there is really no other alternative, so apply
>     * disable_cost.  (The executor normally copes with excessive memory usage
>     * by splitting batches, but obviously it cannot separate equal values
>     * that way, so it will be unable to drive the batch size below work_mem
>     * when this is true.)
>     */
>    if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
>                           inner_path->pathtarget->width) >
>        (work_mem * 1024L))
>        startup_cost += disable_cost;
>
>It's certainly likely that that logic needs improvement in view of this
>discussion --- I was just pushing back on the claim that we weren't
>considering the issue at all.
>

Ah, this code is new in 11, and I was looking at code from 10 for some
reason. I don't think we can do much better than this, except perhaps
falling back to (1/ndistinct) when there's no MCV available.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:


On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

It does slightly increase the amount of data we need to shuffle between
the temp files, because we can't write the data directly to batches in
"future" slices. But that amplification is capped to ~2.2x (compared to
the ~1.4x in master) - I've shared some measurements in [1].

[1] https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development


Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".

In terms of how many times you write each tuple, is it accurate to say that a
tuple can now be spilled three times (in the worst case) whereas, before, it
could be spilled only twice?

1 - when building the inner side hashtable, tuple is spilled to a "slice" file
2 - (assuming the number of batches was increased) during execution, when a
tuple belonging to a later slice's spill file is found, it is re-spilled to that
slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled (again)
to its batch's spill file

Is it correct that the max number of BufFile structs you will have is equal to
the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a time?

By the way, applying v4 patch on master, in an assert build, I am tripping some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite

One thing I was a little confused by was the nbatch_inmemory member of the
hashtable.  The comment in ExecChooseHashTableSize says that it is determining
the number of batches we can fit in memory.  I thought that the problem was the
amount of space taken up by the BufFile data structure itself--which is related
to the number of open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about fitting more
than one batch of tuples into memory at a time. I was under the impression that
you could only fit one batch of tuples in memory at a time.

So, I was stepping through the code with work_mem set to the lower bound, and in
ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2
so, I didn't meet this condition
if (nbatch_tmp > hashtable->nbatch_inmemory)
since I just set nbatch_tmp using hashtable->nbatch_inmemory
So, I didn't increase the number of slices, which is what I was expecting.
What happens when hashtable->nbatch_inmemory is equal to nbatch_tmp?

--
Melanie Plageman

Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:

On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>> Switching to some other algorithm during execution moves the goal posts
>> to the next galaxy, I'm afraid.
>
>The main problem I'm aware of with sort-merge join is: not all that is
>hashable is sortable.  So BNL is actually the only solution I'm aware
>of for problem B that doesn't involve changing a fundamental thing
>about PostgreSQL's data type requirements.
>

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.


What if you switched to NLJ on a batch-by-batch basis and did it before starting
execution of the join but after building the inner side of the hash table.  That
way, no tuples will have been sent to other nodes yet.

--
Melanie Plageman

Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 07, 2019 at 05:43:56PM -0700, Melanie Plageman wrote:
>   On Tue, May 7, 2019 at 6:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>   wrote:
>
>     On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>     >On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
>     ><tomas.vondra@2ndquadrant.com> wrote:
>     >> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>     >> Switching to some other algorithm during execution moves the goal
>     posts
>     >> to the next galaxy, I'm afraid.
>     >
>     >The main problem I'm aware of with sort-merge join is: not all that is
>     >hashable is sortable.  So BNL is actually the only solution I'm aware
>     >of for problem B that doesn't involve changing a fundamental thing
>     >about PostgreSQL's data type requirements.
>     >
>
>     Sure, each of those algorithms has limitations. But I think that's
>     mostly
>     irrelevant to the main issue - switching between algorithms
>     mid-execution.
>     At that point some of the tuples might have been already sent sent to
>     the
>     other nodes, and I have no idea how to "resume" the tuple stream short
>     of
>     buffering everything locally until the join completes. And that would be
>     rather terrible, I guess.
>
>   What if you switched to NLJ on a batch-by-batch basis and did it before
>   starting
>   execution of the join but after building the inner side of the hash
>   table.  That
>   way, no tuples will have been sent to other nodes yet.
>

Interesting idea! I think you're right doing it on a per-batch basis
would solve that problem. Essentially, if all (or >95%) of the tuples
has the same hash value, we could switch to a special "degraded" mode
doing something like a NL. At that point the hash table benefits are
lost anyway, because all the tuples are in a single chain, so it's not
going to be much slower.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>   On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>   wrote:
>
>     Nope, that's not how it works. It's the array of batches that gets
>     sliced, not the batches themselves.
>
>     It does slightly increase the amount of data we need to shuffle between
>     the temp files, because we can't write the data directly to batches in
>     "future" slices. But that amplification is capped to ~2.2x (compared to
>     the ~1.4x in master) - I've shared some measurements in [1].
>
>     [1]
>     https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>
>   Cool, I misunderstood. I looked at the code again today, and, at the email
>   thread where you measured "amplification".
>

Oh! I hope you're not too disgusted by the code in that PoC patch ;-)

>   In terms of how many times you write each tuple, is it accurate to
>   say that a tuple can now be spilled three times (in the worst case)
>   whereas, before, it could be spilled only twice?
>
>   1 - when building the inner side hashtable, tuple is spilled to a "slice"
>   file
>   2 - (assuming the number of batches was increased) during execution, when
>   a tuple belonging to a later slice's spill file is found, it is re-spilled
>   to that slice's spill file
>   3 - during execution, when reading from its slice file, it is re-spilled
>   (again) to its batch's spill file
>

Yes, that's mostly accurate understanding. Essentially this might add
one extra step of "reshuffling" from the per-slice to per-batch files.

>   Is it correct that the max number of BufFile structs you will have
>   is equal to the number of slices + number of batches in a slice
>   because that is the max number of open BufFiles you would have at a
>   time?

Yes. With the caveat that we need twice that number of BufFile structs,
because we need them on both sides of the join.

>   By the way, applying v4 patch on master, in an assert build, I am tripping
>   some
>   asserts -- starting with
>   Assert(!file->readOnly);
>   in BufFileWrite

Whoooops :-/

>   One thing I was a little confused by was the nbatch_inmemory member
>   of the hashtable.  The comment in ExecChooseHashTableSize says that
>   it is determining the number of batches we can fit in memory.  I
>   thought that the problem was the amount of space taken up by the
>   BufFile data structure itself--which is related to the number of
>   open BufFiles you need at a time. This comment in
>   ExecChooseHashTableSize makes it sound like you are talking about
>   fitting more than one batch of tuples into memory at a time. I was
>   under the impression that you could only fit one batch of tuples in
>   memory at a time.

I suppose you mean this chunk:

    /*
     * See how many batches we can fit into memory (driven mostly by size
     * of BufFile, with PGAlignedBlock being the largest part of that).
     * We need one BufFile for inner and outer side, so we count it twice
     * for each batch, and we stop once we exceed (work_mem/2).
     */
    while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
           <= (work_mem * 1024L / 2))
        nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

>   So, I was stepping through the code with work_mem set to the lower
>   bound, and in ExecHashIncreaseNumBatches, I got confused.
>   hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so,
>   I didn't meet this condition if (nbatch_tmp >
>   hashtable->nbatch_inmemory) since I just set nbatch_tmp using
>   hashtable->nbatch_inmemory So, I didn't increase the number of
>   slices, which is what I was expecting.  What happens when
>   hashtable->nbatch_inmemory is equal to nbatch_tmp?
>

Ah, good catch. The condition you're refering to

    if (nbatch_tmp > hashtable->nbatch_inmemory)

should actually be

    if (nbatch > hashtable->nbatch_inmemory)

because the point is to initialize BufFile structs for the overflow
files, and we need to do that once we cross nbatch_inmemory.

And it turns out this actually causes the assert failures in regression
tests, you reported earlier. It failed to initialize the overflow files
in some cases, so the readOnly flag seemed to be set.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:


On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>   One thing I was a little confused by was the nbatch_inmemory member
>   of the hashtable.  The comment in ExecChooseHashTableSize says that
>   it is determining the number of batches we can fit in memory.  I
>   thought that the problem was the amount of space taken up by the
>   BufFile data structure itself--which is related to the number of
>   open BufFiles you need at a time. This comment in
>   ExecChooseHashTableSize makes it sound like you are talking about
>   fitting more than one batch of tuples into memory at a time. I was
>   under the impression that you could only fit one batch of tuples in
>   memory at a time.

I suppose you mean this chunk:

    /*
     * See how many batches we can fit into memory (driven mostly by size
     * of BufFile, with PGAlignedBlock being the largest part of that).
     * We need one BufFile for inner and outer side, so we count it twice
     * for each batch, and we stop once we exceed (work_mem/2).
     */
    while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
           <= (work_mem * 1024L / 2))
        nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

I definitely would prefer to see hashtable->nbatch_inmemory renamed to
hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?

I've been poking around the code for awhile today, and, even though I
know that the nbatch_inmemory is referring to the buffiles that can
fit in memory, I keep forgetting and thinking it is referring to the
tuple data that can fit in memory.

It might be worth explicitly calling out somewhere in the comments
that overflow slices will only be created either when the number of
batches was underestimated as part of ExecHashIncreaseNumBatches and
the new number of batches exceeds the value for
hashtable->nbatch_inmemory or when creating the hashtable initially
and the number of batches exceeds the value for
hashtable->nbatch_inmemory (the name confuses this for me at hashtable
creation time especially) -- the number of actual buffiles that can be
managed in memory.
 

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


So, I ran the following example on master and with your patch.

drop table foo;
drop table bar;
create table foo(a int, b int);
create table bar(c int, d int);
insert into foo select i, i from generate_series(1,10000)i;
insert into bar select 1, 1 from generate_series(1,1000)i;
insert into bar select i%3, i%3 from generate_series(1000,10000)i;
insert into foo select 1,1 from generate_series(1,1000)i;
analyze foo; analyze bar;
set work_mem=64;

On master, explain analyze looked like this

postgres=# explain analyze verbose select * from foo, bar where a = c;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual time=28.962..1048.442 rows=4008001 loops=1)
   Output: foo.a, foo.b, bar.c, bar.d
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8) (actual time=0.030..1.777 rows=10001 loops=1)
         Output: bar.c, bar.d
   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual time=12.285..12.285 rows=11000 loops=1)
         Output: foo.a, foo.b
         Buckets: 2048 (originally 2048)  Batches: 64 (originally 16)  Memory Usage: 49kB
         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8) (actual time=0.023..3.786 rows=11000 loops=1)
               Output: foo.a, foo.b
 Planning Time: 0.435 ms
 Execution Time: 1206.904 ms
(12 rows)

and with your patch, it looked like this.

postgres=# explain analyze verbose select * from foo, bar where a = c;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual time=28.256..1102.026 rows=4008001 loops=1)
   Output: foo.a, foo.b, bar.c, bar.d
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8) (actual time=0.040..1.717 rows=10001 loops=1)
         Output: bar.c, bar.d
   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual time=12.327..12.327 rows=11000 loops=1)
         Output: foo.a, foo.b
         Buckets: 2048 (originally 2048)  Batches: 16384 (originally 16, in-memory 2)  Memory Usage: 131160kB
         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8) (actual time=0.029..3.569 rows=11000 loops=1)
               Output: foo.a, foo.b
 Planning Time: 0.260 ms
 Execution Time: 1264.995 ms
(12 rows)

I noticed that the number of batches is much higher with the patch,
and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
temp files which are the overflow files any given time was quite high.

I would imagine that the desired behaviour is to keep memory usage
within work_mem.
In this example, the number of slices is about 8000, each of which
would have an overflow file. Is this the case you mention in the
comment in ExecChooseHashTableSize ?

* We ignore (per-slice)
* overflow files, because those serve as "damage control" for cases
* when per-batch BufFiles would exceed work_mem. Given enough batches
* it's impossible to enforce work_mem strictly, because the overflow
* files alone will consume more memory.

--
Melanie Plageman

Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 21, 2019 at 05:38:50PM -0700, Melanie Plageman wrote:
>On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>> >   One thing I was a little confused by was the nbatch_inmemory member
>> >   of the hashtable.  The comment in ExecChooseHashTableSize says that
>> >   it is determining the number of batches we can fit in memory.  I
>> >   thought that the problem was the amount of space taken up by the
>> >   BufFile data structure itself--which is related to the number of
>> >   open BufFiles you need at a time. This comment in
>> >   ExecChooseHashTableSize makes it sound like you are talking about
>> >   fitting more than one batch of tuples into memory at a time. I was
>> >   under the impression that you could only fit one batch of tuples in
>> >   memory at a time.
>>
>> I suppose you mean this chunk:
>>
>>     /*
>>      * See how many batches we can fit into memory (driven mostly by size
>>      * of BufFile, with PGAlignedBlock being the largest part of that).
>>      * We need one BufFile for inner and outer side, so we count it twice
>>      * for each batch, and we stop once we exceed (work_mem/2).
>>      */
>>     while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
>>            <= (work_mem * 1024L / 2))
>>         nbatch_inmemory *= 2;
>>
>> Yeah, that comment is a bit confusing. What the code actually does is
>> computing the largest "slice" of batches for which we can keep the
>> BufFile structs in memory, without exceeding work_mem/2.
>>
>> Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.
>>
>
>I definitely would prefer to see hashtable->nbatch_inmemory renamed to
>hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?
>
>I've been poking around the code for awhile today, and, even though I
>know that the nbatch_inmemory is referring to the buffiles that can
>fit in memory, I keep forgetting and thinking it is referring to the
>tuple data that can fit in memory.
>

That's a fair point. I think nbatch_slice is a good name.

>It might be worth explicitly calling out somewhere in the comments
>that overflow slices will only be created either when the number of
>batches was underestimated as part of ExecHashIncreaseNumBatches and
>the new number of batches exceeds the value for
>hashtable->nbatch_inmemory or when creating the hashtable initially
>and the number of batches exceeds the value for
>hashtable->nbatch_inmemory (the name confuses this for me at hashtable
>creation time especially) -- the number of actual buffiles that can be
>managed in memory.
>

Yes, this definitely needs to be explained somewhere - possibly in a
comment at the beginning of nodeHash.c or something like that.

FWIW I wonder if this "slicing" would be useful even with correct
estimates. E.g. let's say we can fit 128 batches into work_mem, but we
expect to need 256 (and it's accurate). At that point it's probably too
aggressive to disable hash joins - a merge join is likely more expensive
than just using the slicing. But that should be a cost-based decision.

>
>>
>> Attached is an updated patch, fixing this. I tried to clarify some of
>> the comments too, and I fixed another bug I found while running the
>> regression tests. It's still very much a crappy PoC code, though.
>>
>>
>So, I ran the following example on master and with your patch.
>
>drop table foo;
>drop table bar;
>create table foo(a int, b int);
>create table bar(c int, d int);
>insert into foo select i, i from generate_series(1,10000)i;
>insert into bar select 1, 1 from generate_series(1,1000)i;
>insert into bar select i%3, i%3 from generate_series(1000,10000)i;
>insert into foo select 1,1 from generate_series(1,1000)i;
>analyze foo; analyze bar;
>set work_mem=64;
>
>On master, explain analyze looked like this
>
>postgres=# explain analyze verbose select * from foo, bar where a = c;
>                                                        QUERY PLAN
>

>--------------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual
>time=28.962..1048.442 rows=4008001 loops=1)
>   Output: foo.a, foo.b, bar.c, bar.d
>   Hash Cond: (bar.c = foo.a)
>   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8)
>(actual time=0.030..1.777 rows=10001 loops=1)
>         Output: bar.c, bar.d
>   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual
>time=12.285..12.285 rows=11000 loops=1)
>         Output: foo.a, foo.b
>         Buckets: 2048 (originally 2048)  Batches: 64 (originally 16)
> Memory Usage: 49kB
>         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8)
>(actual time=0.023..3.786 rows=11000 loops=1)
>               Output: foo.a, foo.b
> Planning Time: 0.435 ms
> Execution Time: 1206.904 ms
>(12 rows)
>
>and with your patch, it looked like this.
>
>postgres=# explain analyze verbose select * from foo, bar where a = c;
>                                                        QUERY PLAN
>

>--------------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=339.50..53256.27 rows=4011001 width=16) (actual
>time=28.256..1102.026 rows=4008001 loops=1)
>   Output: foo.a, foo.b, bar.c, bar.d
>   Hash Cond: (bar.c = foo.a)
>   ->  Seq Scan on public.bar  (cost=0.00..145.01 rows=10001 width=8)
>(actual time=0.040..1.717 rows=10001 loops=1)
>         Output: bar.c, bar.d
>   ->  Hash  (cost=159.00..159.00 rows=11000 width=8) (actual
>time=12.327..12.327 rows=11000 loops=1)
>         Output: foo.a, foo.b
>         Buckets: 2048 (originally 2048)  Batches: 16384 (originally 16,
>in-memory 2)  Memory Usage: 131160kB
>         ->  Seq Scan on public.foo  (cost=0.00..159.00 rows=11000 width=8)
>(actual time=0.029..3.569 rows=11000 loops=1)
>               Output: foo.a, foo.b
> Planning Time: 0.260 ms
> Execution Time: 1264.995 ms
>(12 rows)
>
>I noticed that the number of batches is much higher with the patch,
>and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
>temp files which are the overflow files any given time was quite high.
>
>I would imagine that the desired behaviour is to keep memory usage
>within work_mem.

There's definitely something fishy going on. I suspect it's either because
of the duplicate values (which might fit into 64kB on master, but not when
accounting for BufFile). Or maybe it's because the initial 16 batches
can't possibly fit into work_mem.

If you try with a larger work_mem, say 256kB, does that behave OK?

>In this example, the number of slices is about 8000, each of which
>would have an overflow file. Is this the case you mention in the
>comment in ExecChooseHashTableSize ?
>
>* We ignore (per-slice)
>* overflow files, because those serve as "damage control" for cases
>* when per-batch BufFiles would exceed work_mem. Given enough batches
>* it's impossible to enforce work_mem strictly, because the overflow
>* files alone will consume more memory.
>

Yes. 8000 slices is ~64MB, so considering we need them on both sides of
the join that'd be ~128MB. Which is pretty much exactly 131160kB.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Hubert Zhang
Дата:
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Hi Tomas

I read your second patch which uses overflow buf files to reduce the total number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per batch leads to batch bloating problem.

I mentioned in another thread:
There is another hashjoin OOM problem which disables splitting batches too early. PG uses a flag hashtable->growEnable to determine whether to split batches. Once one splitting failed(all the tuples are assigned to only one batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to performance), in-memory hash table takes memory as well and splitting batched may(not must) reduce the in-memory hash table size but introduce more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM problem. No matter the OOM is introduced by (the memory usage of in-memory hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang

Re: accounting for memory used for BufFile during hash joins

От
Hubert Zhang
Дата:
Hi Tomas,

Here is the patch, it's could be compatible with your patch and it focus on when to regrow the batch.


On Tue, May 28, 2019 at 3:40 PM Hubert Zhang <hzhang@pivotal.io> wrote:
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Hi Tomas

I read your second patch which uses overflow buf files to reduce the total number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per batch leads to batch bloating problem.

I mentioned in another thread:
There is another hashjoin OOM problem which disables splitting batches too early. PG uses a flag hashtable->growEnable to determine whether to split batches. Once one splitting failed(all the tuples are assigned to only one batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to performance), in-memory hash table takes memory as well and splitting batched may(not must) reduce the in-memory hash table size but introduce more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM problem. No matter the OOM is introduced by (the memory usage of in-memory hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang


--
Thanks

Hubert Zhang
Вложения

Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Tue, May 28, 2019 at 03:40:01PM +0800, Hubert Zhang wrote:
>On Sat, May 4, 2019 at 8:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>Hi Tomas
>
>I read your second patch which uses overflow buf files to reduce the total
>number of batches.
>It would solve the hash join OOM problem what you discussed above: 8K per
>batch leads to batch bloating problem.
>
>I mentioned in another thread:
>
>https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com
>There is another hashjoin OOM problem which disables splitting batches too
>early. PG uses a flag hashtable->growEnable to determine whether to split
>batches. Once one splitting failed(all the tuples are assigned to only one
>batch of two split ones) The growEnable flag would be turned off forever.
>
>The is an opposite side of batch bloating problem. It only contains too few
>batches and makes the in-memory hash table too large to fit into memory.
>

Yes. There are deffinitely multiple separate issues in the hashjoin code,
and the various improvements discussed in this (and other) thread usually
address just a subset of them. We need to figure out how to combine them
or maybe devise some more generic solution.

So I think we need to take a step back, and figure out how to combine
these improvements - otherwise we might commit a fix for one issue, making
it much harder/impossible to improve the other issues.

The other important question is whether we see these cases as outliers
(and the solutions as last-resort-attempt-to-survive kind of fix) or more
widely applicable optimizations. I've seen some interesting speedups with
the overflow-batches patch, but my feeling is we should really treat it as
a last-resort to survive. 

I had a chat about this with Thomas Munro yesterday. Unfortunately, some
beer was involved but I do vaguely remember he more or less convinced me
the BNL (block nested loop join) might be the right approach here. We
don't have any patch for that yet, though :-(

>Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
>performance), in-memory hash table takes memory as well and splitting
>batched may(not must) reduce the in-memory hash table size but introduce
>more batches(and thus more memory usage 8KB*#batch).
>Can we conclude that it would be worth to splitting if satisfy:
>(The reduced memory of in-memory hash table) - (8KB * number of new
>batches) > 0
>

Something like that, yes.

>So I'm considering to combine our patch with your patch to fix join OOM
>problem. No matter the OOM is introduced by (the memory usage of in-memory
>hash table) or (8KB * number of batches).
>
>nbatch_inmemory in your patch could also use the upper rule to redefine.
>
>What's your opinion?
>

One of the issues with my "overflow batches" patch, pointed out to me by
Thomas yesterday, is that it only works with non-parallel hash join. And
we don't know how to make it work in the parallel mode :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:
Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.

1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.

2) Parallel HashJoin implementation of this patch's approach

I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.

I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).

I'll start with 2 since I have less to say there.

From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?

Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).

You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.

Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:

Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":

> A.  Broken escape valve:  sometimes we generate a huge number of
> batches while trying to split up many duplicates, because of the
> presence of other more uniformly distributed keys.  We could fix that
> with (say) a 95% rule.
> B.  Lack of good alternative execution strategy when the escape valve
> is triggered.  A batch cannot be split effectively, but cannot fit in
> work_mem, so for now we decide to ignore work_mem.
> C.  Unmetered explosion of batches and thus BufFiles, probably usually
> caused by problem A, but theoretically also due to a real need for
> partitions.

However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).

The following scenarios are how you could end up running out of
memory:

Y. Plan-time underestimation of the number of required batches with
relatively uniform data distribution

In this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead

Z. Plan and or execution time underestimation of the number of
required batches with skewed data

If you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.

If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.

To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)

If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.

Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]).

For slicing, each unit, a slice, has multiple batches but one spill
file.
For hbnlj, each unit, a chunk, is one of multiple chunks in a single
batch, all of which are in the same spill file (1 batch = 1 spill
file).

Thinking through it, it seems to make the most sense to split the work
into ~ 3 independent pieces:

patch1 - "preview" a batch increase (not yet written [I think])
patch2 - slicing (Tomas' patch [2] but add in threshhold for portion of
work_mem buffile overhead is using)
patch3 - hash block nested loop join (my patch [1])

patch1 allows us to re-enable growth and was mentioned upthread, but I
will quote it here for simplicity:

> I think we can fix A by relaxing the escape valve condition, and then
> rechecking it once in a while. So we fill work_mem, realize it didn't
> actually reduce the batch size significantly and disable nbatch growth.
> But at the same time we increase the threshold to 2x work_mem, and after
> reaching it we "consider" a nbatch increase.  That is, we walk the batch
> and see how many tuples would move if we increased nbatch (that should be
> fairly cheap) - if it helps, great, enable growth and split the batch. If
> not, double the threshold again.  Rinse and repeat.

We don't want to fill up work_mem with buffile overhead after
increasing nbatches many times just to move a few tuples for one batch
and end up disabling growth thus making it so that later we can't
increase nbatches and repartition for a batch that would nicely
partition (like Hubert's case, I believe [3]).

We want to identify when re-partitioning would help and only do it
then and, for times when it wouldn't help, use a fallback strategy
that still allows progress on the hashjoin, and, for some spiky data,
where we have re-partitioned for the right reasons, but there are
still a lot of batches that are small enough that they could all fit
in memory at once, we want to track them with as little overhead as
possible -- lump them into slices.

We should probably consider deciding to use slices based on some
threshold for the portion of work_mem which is allowed to be occupied
by buffile overhead instead of waiting until the buffile overhead is
literally taking up most of work_mem.

The above summary is to address the concern in this thread about a
holistic solution.

I think the slicing patch is independent of both the hash block nested
loop join patch and the "preview" mode for batch increasing.

If slicing is made to work for parallel-aware hashjoin and the code is
in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.

[1] https://www.postgresql.org/message-id/CAAKRu_ZkkukQgXCK8ADe-PmvcmpZh6G1Uin8pqqovL4x7P30mQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20190508150844.rij36rtuk4lhvztw%40development
[3] https://www.postgresql.org/message-id/CAB0yre%3De8ysPyoUvZqjKYAxc6-VB%3DJKHL-7XKZSxy0FT5vY7BQ%40mail.gmail.com

--
Melanie Plageman

Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
>Okay, so, while I do have specific, actual code review/commitfest-y
>feedback for the patch in this thread registered for this commitfest,
>I wanted to defer that for a later email and use this one to cover off
>on a few higher level issues.
>
>1) How this patch's approach fits into the wider set of problems with
>hybrid hashjoin.
>
>2) Parallel HashJoin implementation of this patch's approach
>
>I think implementing support for parallel hashjoin or explicitly
>disabling it would be the bare minimum for this patch, which is why I
>made 2 its own item. I've marked it as returned to author for this
>reason.
>

OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.

>I do think that accounting for Buffile overhead when estimating the
>size of the hashtable during ExecChooseHashTableSize() so it can be
>used during planning is a worthwhile patch by itself (though I know it
>is not even part of this patch).
>

+1 to that

>I'll start with 2 since I have less to say there.
>
>From comments upthread, I take it this would not work with parallel
>hashjoin as expected. Is this because each worker operates on batches
>independently and now batches are lumped into slices?
>
>Thinking through a parallel-aware implementation, it seems like you
>would use slice-based barriers for the build phase but batch-based
>barriers for the probe phase to avoid getting out of sync (workers
>with outer tuples from one batch should not try and join those with
>tuples from another batch, even if in the same slice).
>
>You would, of course, need to add code to make slices work with
>SharedTuplestore--caveat here is I still haven't tried to understand
>how parallel-aware hashjoin works/uses SharedTuplestore.
>

I don't know. I haven't thought about the parallel version very much. I
wonder if Thomas Munro has some thoughts about it ...

>Now, addressing 1, how this patch fits into the wider set of problem's
>with current hybrid hashjoin:
>
>Thomas Munro nicely summarized roughly what I'm about to lay out like
>this (upthread) -- he called them "three separate but related
>problems":
>
>> A.  Broken escape valve:  sometimes we generate a huge number of
>> batches while trying to split up many duplicates, because of the
>> presence of other more uniformly distributed keys.  We could fix that
>> with (say) a 95% rule.
>> B.  Lack of good alternative execution strategy when the escape valve
>> is triggered.  A batch cannot be split effectively, but cannot fit in
>> work_mem, so for now we decide to ignore work_mem.
>> C.  Unmetered explosion of batches and thus BufFiles, probably usually
>> caused by problem A, but theoretically also due to a real need for
>> partitions.
>
>However, I would like to lay out the problem space a little bit
>differently. (using the end of the alphabet to differentiate).
>
>The following scenarios are how you could end up running out of
>memory:
>
>Y. Plan-time underestimation of the number of required batches with
>relatively uniform data distribution
>
>In this case, the best join execution strategy is a plain hashjoin
>with spilling as needed.
>nbatches should be increased as needed, because the data is ~evenly
>distributed.
>slicing should be employed when buffile overhead exceeds some
>threshhold for the ratio of work_mem to be used for buffile overhead
>

OK, makes sense. But at some point we get so many slices the overflow
files alone use more than work_mem. Of course, to hit that the
underestimate needs to be sufficiently serious. My understanding was we'll
roll until that point and then switch to BNL.

>Z. Plan and or execution time underestimation of the number of
>required batches with skewed data
>
>If you knew this at planning time, you could have picked another
>join-type, though, there might be cases where it would actually be
>less costly to use plain hashjoin for all batches except the bad batch
>and fall back to hash block nested loop join just for the duplicate
>values.
>
>If you could not have known this at planning time, the best join
>execution strategy is a hybrid hashjoin/hash block nested loop join.
>
>To do this, preview if increasing nbatches would move tuples, and, if
>it would, do this (also, employing slicing if buffile overhead exceeds
>the threshold)
>
>If increasing nbatches wouldn't move tuples, process this batch with
>hash block nested loop join.
>

OK.

>Essentially, what we want is logical units of tuples which are
>work_mem-sized. In some cases, each unit may contain multiple batches
>(a slice in Tomas' patch) and in other cases, each unit may contain
>only part of a batch (a chunk is the term I used in my hash block
>nested loop join patch [1]).
>

OK, although with slicing the work_mem-sized unit is still one batch. The
slice just ensures the metadata we need to keep in memory does not grow as
O(N) with the number of batches (instead it's O(log(N)) I think).

>For slicing, each unit, a slice, has multiple batches but one spill
>file.
>For hbnlj, each unit, a chunk, is one of multiple chunks in a single
>batch, all of which are in the same spill file (1 batch = 1 spill
>file).
>

Yep.

>Thinking through it, it seems to make the most sense to split the work
>into ~ 3 independent pieces:
>
>patch1 - "preview" a batch increase (not yet written [I think])
>patch2 - slicing (Tomas' patch [2] but add in threshhold for portion of
>work_mem buffile overhead is using)
>patch3 - hash block nested loop join (my patch [1])
>
>patch1 allows us to re-enable growth and was mentioned upthread, but I
>will quote it here for simplicity:
>
>> I think we can fix A by relaxing the escape valve condition, and then
>> rechecking it once in a while. So we fill work_mem, realize it didn't
>> actually reduce the batch size significantly and disable nbatch growth.
>> But at the same time we increase the threshold to 2x work_mem, and after
>> reaching it we "consider" a nbatch increase.  That is, we walk the batch
>> and see how many tuples would move if we increased nbatch (that should be
>> fairly cheap) - if it helps, great, enable growth and split the batch. If
>> not, double the threshold again.  Rinse and repeat.
>
>We don't want to fill up work_mem with buffile overhead after
>increasing nbatches many times just to move a few tuples for one batch
>and end up disabling growth thus making it so that later we can't
>increase nbatches and repartition for a batch that would nicely
>partition (like Hubert's case, I believe [3]).
>

Yes, this seems like a very reasonable plan. Also, I now see it actually
explains what the plan with BNL vs. slicing is.

>We want to identify when re-partitioning would help and only do it
>then and, for times when it wouldn't help, use a fallback strategy
>that still allows progress on the hashjoin, and, for some spiky data,
>where we have re-partitioned for the right reasons, but there are
>still a lot of batches that are small enough that they could all fit
>in memory at once, we want to track them with as little overhead as
>possible -- lump them into slices.
>
>We should probably consider deciding to use slices based on some
>threshold for the portion of work_mem which is allowed to be occupied
>by buffile overhead instead of waiting until the buffile overhead is
>literally taking up most of work_mem.
>

But that heuristics is already there, no? That's the "Don't use more than
2*work_mem/3 for batch BufFiles" at which point we start adding slices.

>The above summary is to address the concern in this thread about a
>holistic solution.
>
>I think the slicing patch is independent of both the hash block nested
>loop join patch and the "preview" mode for batch increasing.
>
>If slicing is made to work for parallel-aware hashjoin and the code is
>in a committable state (and probably has the threshold I mentioned
>above), then I think that this patch should go in.
>

Yes, I think this seems like a good plan.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Robert Haas
Дата:
On Mon, May 6, 2019 at 9:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Stepping back a bit, I think there is something fishy about the way we
> detect extreme skew.  Is that a factor in this case?  Right now we
> wait until we have a batch that gets split into child batches
> containing exactly 0% and 100% of the tuples before we give up.
> Previously I had thought of that as merely a waste of time, but
> clearly it's also a waste of unmetered memory.  Oops.
>
> I think our extreme skew detector should go off sooner, because
> otherwise if you have N nicely distributed unique keys and also M
> duplicates of one bad egg key that'll never fit in memory, we keep
> repartitioning until none of the N keys fall into the batch containing
> the key for the M duplicates before we give up!  You can use
> balls-into-bins maths to figure out the number, but I think that means
> we expect to keep splitting until we have N * some_constant batches,
> and that's just silly and liable to create massive numbers of
> partitions proportional to N, even though we're trying to solve a
> problem with M.  In another thread I suggested we should stop when
> (say) 95% of the tuples go to one child batch.  I'm not sure how you
> pick the number.

Another thing that is fishy about this is that we can't split a batch
or a bucket without splitting them all.  Let's say that nbatches *
nbuckets = 16 million. One bucket in one batch contains 90% of the
tuples. Splitting *that* bucket might be a good idea if only 5% of the
tuples end up moving, perhaps even if only 1% end up moving. But, if
you have to double the total number of batches to get that benefit,
it's a lot less compelling, because now you have to rescan the outer
side more times.

I wonder whether we should be dividing things into batches unevenly,
based on the distribution of tuples we've seen so far.  For example,
suppose we've gotten to 1024 buckets and that's all we can fit in
memory. If we decide to go to 2 batches, we'll use the next bit of the
hash key to decide which things go into batch 0 and which things go
into batch 1. But if we know that 50% of the data is in bucket 17, why
are we not making bucket 17 into a batch and everything else into
another batch? Then, when we process the batch that was derived from
bucket-17, we can use 10 completely new bits from the hash key to
slice the data from that bucket as finely as possible.

Now the bucket might be entirely duplicates, in which case no number
of additional bits will help.  However, even in that case it's still a
good idea to make it its own batch, and then use some other algorithm
to process that batch. And if it's *not* entirely duplicates, but
there are say 2 or 3 really common values that unluckily hash to the
same bucket, then being able to use a lot more bits for that portion
of the data gives us the best chance of managing to spread it out into
different buckets.

Similarly, if we split the hash join into four batches, and batch 0
fits in memory but batch 1 does not, we cannot further split batch 1
without splitting batch 2 and batch 3 also.  That's not good either,
because those batches might be small and not need splitting.

I guess what I'm trying to say is that our algorithms for dealing with
mis-estimation seem to be largely oblivious to the problem of skew,
and I don't think the problem is confined to extreme skew. Suppose you
have some data that is only moderately skewed, so that when you build
a hash table with 1024 buckets, 25% of the data is in buckets 0-19,
25% in buckets 20-768, 25% in buckets 769-946, and the last 25% in
buckets 947-1023. If you knew that, then when you discover that the
data is 4x too large to fit in memory, you can divide the data into 4
batches using those bucket number ranges, and get it done in exactly 4
batches. As it is, you'll need to split until every uniform range of
buckets fits in memory: 0-31 is going to be too big a range, so you're
going to go with 0-15, which means you'll have 64 batches instead of
4.

It seems to me that a good chunk of what's being proposed right now
basically ignores the fact that we're not really responding to the
skew in a very effective way.  Thomas wants to stop splitting all the
buckets when splitting one of the buckets produces only a very small
benefit rather than when it produces no benefit at all, but he's not
asking why we're splitting all of the buckets in the first place.
Tomas wants to slice the array of batches because there are so many of
them, but why are there so many? As he said himself, "it gets to that
many batches because some of the values are very common and we don't
disable the growth earlier."  Realistically, I don't see how there can
be so many batches that we can't even fit the metadata about the
matches into memory unless we're unnecessarily creating a lot of
little tiny batches that we don't really need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: accounting for memory used for BufFile during hash joins

От
Hubert Zhang
Дата:

On Fri, Jul 12, 2019 at 1:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 6, 2019 at 9:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Stepping back a bit, I think there is something fishy about the way we
> detect extreme skew.  Is that a factor in this case?  Right now we
> wait until we have a batch that gets split into child batches
> containing exactly 0% and 100% of the tuples before we give up.
> Previously I had thought of that as merely a waste of time, but
> clearly it's also a waste of unmetered memory.  Oops.
>
> I think our extreme skew detector should go off sooner, because
> otherwise if you have N nicely distributed unique keys and also M
> duplicates of one bad egg key that'll never fit in memory, we keep
> repartitioning until none of the N keys fall into the batch containing
> the key for the M duplicates before we give up!  You can use
> balls-into-bins maths to figure out the number, but I think that means
> we expect to keep splitting until we have N * some_constant batches,
> and that's just silly and liable to create massive numbers of
> partitions proportional to N, even though we're trying to solve a
> problem with M.  In another thread I suggested we should stop when
> (say) 95% of the tuples go to one child batch.  I'm not sure how you
> pick the number.

Another thing that is fishy about this is that we can't split a batch
or a bucket without splitting them all.  Let's say that nbatches *
nbuckets = 16 million. One bucket in one batch contains 90% of the
tuples. Splitting *that* bucket might be a good idea if only 5% of the
tuples end up moving, perhaps even if only 1% end up moving. But, if
you have to double the total number of batches to get that benefit,
it's a lot less compelling, because now you have to rescan the outer
side more times.
 
It seems to me that a good chunk of what's being proposed right now
basically ignores the fact that we're not really responding to the
skew in a very effective way.  Thomas wants to stop splitting all the
buckets when splitting one of the buckets produces only a very small
benefit rather than when it produces no benefit at all, but he's not
asking why we're splitting all of the buckets in the first place.
Tomas wants to slice the array of batches because there are so many of
them, but why are there so many? As he said himself, "it gets to that
many batches because some of the values are very common and we don't
disable the growth earlier."  Realistically, I don't see how there can
be so many batches that we can't even fit the metadata about the
matches into memory unless we're unnecessarily creating a lot of
little tiny batches that we don't really need.


+1 on Robert's suggestion. It's worth to find the root cause of the batch explosion problem.
As Robert pointed out "we can't split a batch without spilling them all".  In fact, the hybrid hash join algorithm should only split the overflow batch and avoid to split the small batch which could be processed in memory. Planner should calculate the initial batch number which ensure the average size batch could be processed in memory giving different data distribution. Executor should spilt skew batch in an one-batch-a-time manner.

I will firstly show an example to help understand batch explosion problem. 
Suppose we are going to join R and S and planner calculate the initial nbatch as 4.
In the first batch run, during HJ_BUILD_HASHTABLE state we Scan R and build in memory hash table for batch1 and spill other tuples of R into different batch files(R2-R4). During HJ_NEED_NEW_OUTER and HJ_SCAN_BUCKET state, we do two things: 1. if tuple in S belong to current batch, match it with in memory R1 and emit result to parent plan node; 2. if tuple in S doesn't belong to current batch, spill it to batch files of S2-S4.  As a result, after the first batch run we get:
6 disk files: batch2(R2,S2), batch3(R3,S3) batch4(R4,S4)

Now we run into HJ_NEED_NEW_BATCH state and begin to process R2 and S2. Suppose the second batch R2 is skewed and need to split batch number to 8. When building in-memory hash table for R2, we also split some tuples in R2 into spill file R6.(Based on our hash function, tuples belong to R2 will not be shuffled to batches except R6). After R2's hash table is built, we begin to probe tuples in S2. Since batch number is changed from 4 to 8, some of tuples in S2 now belong to S6 and we spilt them to disk file S6. For other tuples in S2, we match them with R2 and output the result to parent plannode. After the second batch processed, we got:
disk files: batch3(R3,S3), batch4(R4,S4),batch(R6,S6)

Next, we  begin to process R3 and S3. The third batch R3 is not skewed, but since our hash function depends on batch number, which is 8 now. So we have to split some  tuples in R3 to disk file R7, which is not necessary. When Probing S3, we also need to spilt some tuples in S3 into S7, which is not necessary either. Since R3 could be loaded into memory entirely, spill part of R3 to disk file not only introduce more file and file buffers(which is problem Tomas try to solve), but also slow down the performance. After the third batch processed, we got:
disk files:  batch4(R4,S4),batch(R6,S6),batch(R7,S7)

Next, we begin to process R4 and S4. Similar to R3, some tuples in R4 also need to be spilled to file R8. But after this splitting, suppose R4 is still skewed, and we increase the batch number again to 16. As a result, some tuples in R4 will be spilled to file R12 and R16. When probing S4, similarly we need to split some tuples in S4 into S8,S12 and S16. After this step, we get:
 disk files:  batch(R6,S6),batch(R7,S7),batch(R8,S8),batch(R12,S12),batch(R16,S16).

Next, when we begin to process R6 and S6, even if we could build hash table for R6 all in memory, but we have to spilt R6 based on new batch number 16 and spill to file: R14. It's not necessary.

Now we could conclude that increasing batch number would introduce unnecessary repeated spill not only on original batch(R3,S3) but also on new generated batch(R6,S6) in a cascade way. In a worse case, suppose R2 is super skew and need to split 10 times, while R3 is OK to build hash table all in memory. In this case, we have to introduce R7,R11,....,R4095, total 1023 unnecessary spill files. Each of these files may only contain less than ten tuples. Also, we need to palloc file buffer(512KB) for these spill files. This is the so called batch explosion problem.

Solutions:
To avoid these unnecessary repeated spill, I propose to make function ExecHashGetBucketAndBatch as a hash function chain to determine the batchno.
Here is the original implementation of ExecHashGetBucketAndBatch
```

//nbatch is the global batch number

*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

```
We can see the original hash function basically calculate MOD of global batch number(IBN).

A real hybrid hash join should use a hash function chain to determine the batchno. In the new algorithm, the component of hash function chain is defined as: MOD of #IBN, MOD of #IBN*2, MOD of #IBN*4,MOD of #IBN*8 ....etc. A small batch will just use the first hash function in chain, while the skew batch will use the same number of hash functions in chain as the times it is split.
Here is the new implementation of ExecHashGetBucketAndBatch
```
/* i is the current batchno we are processing */
/* hashChainLen record the times batch i is spilt */
for (j=0;j<hashChainLen[i];j++)
{
    batchno = (hashvalue >> hashtable->log2_nbuckets) & ((#initialBatch)* (2^j) - 1);
    /* if the calculated batchno is still i, we need to call more hash functions
     * in chain to determine the final bucketno, else we could return directly.
     */
    if ( batchno != i ) 
       return batchno;
}
return batchno;
```

A quick example, Suppose R3's input is  3,7,11,15,19,23,27,31,35,15,27(we could ensure they MOD4=3)
Suppose Initial batch number is 4 and memory could contain 4 tuples, the 5th tuple need to do batch spilt.
Step1: batch3 process 3,7,11,15,19 and now need to split, 
            chainLen[3]=2
            batch3: 3,11,19
            batch7: 7,15
Step2: 23,27,31 coming
           batch3: 3,11,19,27
           batch7: 7,15,23,31
Step 3: 35 coming, batch3 need to split again
           chainLen[3]=3
           batch3: 3,19,35
           batch7: 7,15,23,31
           batch11: 11,27
Step 4  15 coming, HashFun1 15%4=3, HashFun2 15%8=7; 
           since 7!=3 spill 15 to batch7.
Step 5  27 coming, 27%4=3, 27%8=3, 27%16 =11
           since 27!=3 spill 27 to batch 11. 
Final state:
           chainLen[3]=3
           batch3:  3,19,35
           batch7:  7,15,23,31,15
           batch11: 11,27,27

Here is pseudo code of processing of batch i:
```
/*Step 1: build hash table for Ri*/
tuple = ReadFromFile(Ri);
/* get batchno by the new function*/
batchno =NewExecHashGetBucketAndBatch()
/* do spill if not belong to current batch*/
if(batchno != i)
    spill to file[batchno]
flag = InsertTupleToHashTable(HT, tuple);
if (flag == NEED_SPILT)
{
    hashChainLen[i] ++;
    /* then call ExecHashIncreaseNumBatches() to do the real spill */
}

/* probe stage */
tuple = ReadFromFile(S[i+Bi*k]);
batchno = NewExecHashGetBucketAndBatch()
if (batchno == curbatch)
   probe and match
else
   spillToFile(tuple, batchno)
}
```

This solution only split the batch which needs to be split in a lazy way. If this solution makes sense, I would like write the real patch. 
Any comment?


--
Thanks

Hubert Zhang

Re: accounting for memory used for BufFile during hash joins

От
Alvaro Herrera
Дата:
On 2019-Jul-11, Tomas Vondra wrote:

> On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:

> > I think implementing support for parallel hashjoin or explicitly
> > disabling it would be the bare minimum for this patch, which is why I
> > made 2 its own item. I've marked it as returned to author for this
> > reason.
> 
> OK. I'm a bit confused / unsure what exactly our solution to the various
> hashjoin issues is. I have not been paying attention to all the various
> threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
> against pushing this patch (the slicing one) forward and then maybe add
> BNL on top.

So what's a good way forward for this patch?  Stalling forever like a
glacier is not an option; it'll probably end up melting.  There's a lot
of discussion on this thread which I haven't read, and it's not
immediately clear to me whether this patch should just be thrown away in
favor of something completely different, or it can be considered a first
step in a long road.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:


On Tue, Sep 3, 2019 at 9:36 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jul-11, Tomas Vondra wrote:

> On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:

> > I think implementing support for parallel hashjoin or explicitly
> > disabling it would be the bare minimum for this patch, which is why I
> > made 2 its own item. I've marked it as returned to author for this
> > reason.
>
> OK. I'm a bit confused / unsure what exactly our solution to the various
> hashjoin issues is. I have not been paying attention to all the various
> threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
> against pushing this patch (the slicing one) forward and then maybe add
> BNL on top.

So what's a good way forward for this patch?  Stalling forever like a
glacier is not an option; it'll probably end up melting.  There's a lot
of discussion on this thread which I haven't read, and it's not
immediately clear to me whether this patch should just be thrown away in
favor of something completely different, or it can be considered a first
step in a long road.

So, I have been working on the fallback to block nested loop join
patch--latest non-parallel version posted here [1]. I am currently
still working on the parallel version but don't have a complete
working patch yet. I am hoping to finish it and solicit feedback in
the next couple weeks.

My patch chunks up a bad inner side batch and processes it a chunk
at a time. I haven't spent too much time yet thinking about Hubert's
suggestion proposed upthread. In the past I had asked Tomas about the
idea of splitting up only "bad batches" to avoid having other batches
which are very small. It seemed like this introduced additional
complexity for future spilled tuples finding a home, however, I had
not considered the hash function chain method Hubert is mentioning.

Even if we implemented additional strategies like the one Hubert is
suggesting, I still think that both the slicing patch originally
proposed in this thread as well as a BNLJ fallback option could all
work together, as I believe they solve slightly different problems.

If Tomas or someone else has time to pick up and modify BufFile
accounting patch, committing that still seems like the nest logical
step.

I will work on getting a complete (parallel-aware) BNLJ patch posted
soon.

[1] https://www.postgresql.org/message-id/CAAKRu_ZsRU%2BnszShs3AGVorx%3De%2B2jYkL7X%3DjiNO6%2Bqbho7vRpw%40mail.gmail.com

--
Melanie Plageman

Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Thu, Sep 05, 2019 at 09:54:33AM -0700, Melanie Plageman wrote:
>On Tue, Sep 3, 2019 at 9:36 AM Alvaro Herrera <alvherre@2ndquadrant.com>
>wrote:
>
>> On 2019-Jul-11, Tomas Vondra wrote:
>>
>> > On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
>>
>> > > I think implementing support for parallel hashjoin or explicitly
>> > > disabling it would be the bare minimum for this patch, which is why I
>> > > made 2 its own item. I've marked it as returned to author for this
>> > > reason.
>> >
>> > OK. I'm a bit confused / unsure what exactly our solution to the various
>> > hashjoin issues is. I have not been paying attention to all the various
>> > threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
>> > against pushing this patch (the slicing one) forward and then maybe add
>> > BNL on top.
>>
>> So what's a good way forward for this patch?  Stalling forever like a
>> glacier is not an option; it'll probably end up melting.  There's a lot
>> of discussion on this thread which I haven't read, and it's not
>> immediately clear to me whether this patch should just be thrown away in
>> favor of something completely different, or it can be considered a first
>> step in a long road.
>>
>
>So, I have been working on the fallback to block nested loop join
>patch--latest non-parallel version posted here [1]. I am currently
>still working on the parallel version but don't have a complete
>working patch yet. I am hoping to finish it and solicit feedback in
>the next couple weeks.
>
>My patch chunks up a bad inner side batch and processes it a chunk
>at a time. I haven't spent too much time yet thinking about Hubert's
>suggestion proposed upthread. In the past I had asked Tomas about the
>idea of splitting up only "bad batches" to avoid having other batches
>which are very small. It seemed like this introduced additional
>complexity for future spilled tuples finding a home, however, I had
>not considered the hash function chain method Hubert is mentioning.
>
>Even if we implemented additional strategies like the one Hubert is
>suggesting, I still think that both the slicing patch originally
>proposed in this thread as well as a BNLJ fallback option could all
>work together, as I believe they solve slightly different problems.
>

I have to admit I kinda lost track of how exactly all the HJ patches
posted in various -hackers threads shall work together in the end. We have
far too many in-flight patches dealing with this part of the code at the
moment. It's a bit like with the buses - for years there were no patches
fixing those issues, and now we have 17 ;-)

My feeling is that we should get the BNLJ committed first, and then maybe
use some of those additional strategies as fallbacks (depending on which
issues are still unsolved by the BNLJ).

>If Tomas or someone else has time to pick up and modify BufFile
>accounting patch, committing that still seems like the nest logical
>step.
>

OK, I'll look into that (i.e. considering BufFile memory during planning,
and disabling HJ if not possible).

>I will work on getting a complete (parallel-aware) BNLJ patch posted
>soon.
>

Good!


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: accounting for memory used for BufFile during hash joins

От
Michael Paquier
Дата:
On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
> My feeling is that we should get the BNLJ committed first, and then maybe
> use some of those additional strategies as fallbacks (depending on which
> issues are still unsolved by the BNLJ).

The glacier is melting more.  Tomas, what's your status here?  The
patch has been waiting on author for two months now.  If you are not
planning to work more on this one, then it should be marked as
returned with feedback?
--
Michael

Вложения

Re: accounting for memory used for BufFile during hash joins

От
Tomas Vondra
Дата:
On Mon, Nov 25, 2019 at 05:33:35PM +0900, Michael Paquier wrote:
>On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
>> My feeling is that we should get the BNLJ committed first, and then maybe
>> use some of those additional strategies as fallbacks (depending on which
>> issues are still unsolved by the BNLJ).
>
>The glacier is melting more.  Tomas, what's your status here?  The
>patch has been waiting on author for two months now.  If you are not
>planning to work more on this one, then it should be marked as
>returned with feedback?

I'm not planning to do any any immediate work on this, so I agree with
marking it as RWF. I think Melanie is working on the BNL patch, which
seems like the right solution.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: accounting for memory used for BufFile during hash joins

От
Michael Paquier
Дата:
On Mon, Nov 25, 2019 at 07:11:19PM +0100, Tomas Vondra wrote:
> I'm not planning to do any any immediate work on this, so I agree with
> marking it as RWF. I think Melanie is working on the BNL patch, which
> seems like the right solution.

Thanks, I have switched the patch as returned with feedback.
--
Michael

Вложения

Re: accounting for memory used for BufFile during hash joins

От
Melanie Plageman
Дата:

On Mon, Nov 25, 2019 at 10:11 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Mon, Nov 25, 2019 at 05:33:35PM +0900, Michael Paquier wrote:
>On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
>> My feeling is that we should get the BNLJ committed first, and then maybe
>> use some of those additional strategies as fallbacks (depending on which
>> issues are still unsolved by the BNLJ).
>
>The glacier is melting more.  Tomas, what's your status here?  The
>patch has been waiting on author for two months now.  If you are not
>planning to work more on this one, then it should be marked as
>returned with feedback?

I'm not planning to do any any immediate work on this, so I agree with
marking it as RWF. I think Melanie is working on the BNL patch, which
seems like the right solution.


Sorry for the delay. I have posted the parallel-aware version BNLJ
(adaptive HJ) of this in the thread which originally had all of the
patches for it [1]. It's not near committable, so I wasn't going to
register it for a commitfest yet, but I would love feedback on the
prototype.

--
Melanie Plageman