Обсуждение: Fix overflow of nbatch

Поиск
Список
Период
Сортировка

Fix overflow of nbatch

От
Vaibhav Jain
Дата:
Hi Everyone,

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Please find a patch for the same.

Thanks,
Vaibhav
Вложения

Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:
> With a1b4f28, to compute current_space, nbatch is being multiplied
> by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
> easily overflow the int limit.To keep the calculation safe for
> current_space, convert nbatch to size_t.

Thanks for finding and reporting this.

I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:

size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);

I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
the above two lines. All that code is new in a1b4f289b, and if you
change the nbatch to a size_t it affects the code that existed before
a1b4f289b, and from looking over that code, it seems to ensure that
nbatch does not overflow int by doing "dbatch = Min(dbatch,
max_pointers);", where max_pointers is constrained by MaxAllocSize /
sizeof(HashJoinTuple).

Also, making nbatches size_t makes the final line of " *numbatches =
nbatch;" somewhat questionable since numbatches is an int pointer.

David



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 9/22/25 22:45, David Rowley wrote:
> On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:
>> With a1b4f28, to compute current_space, nbatch is being multiplied
>> by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
>> easily overflow the int limit.To keep the calculation safe for
>> current_space, convert nbatch to size_t.
> 
> Thanks for finding and reporting this.
> 
> I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
> in the following two lines because the final result is a size_t:
> 
> size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
> size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
> 

Yeah, I failed to notice this part of the formula can overflow.

> I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
> the above two lines. All that code is new in a1b4f289b, and if you
> change the nbatch to a size_t it affects the code that existed before
> a1b4f289b, and from looking over that code, it seems to ensure that
> nbatch does not overflow int by doing "dbatch = Min(dbatch,
> max_pointers);", where max_pointers is constrained by MaxAllocSize /
> sizeof(HashJoinTuple).
> 
> Also, making nbatches size_t makes the final line of " *numbatches =
> nbatch;" somewhat questionable since numbatches is an int pointer.
> 

It seems cleaner to add a the cast to the formula, if only because of
the assignment to numbatches.


thanks

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Chao Li
Дата:


On Sep 22, 2025, at 21:20, Vaibhav Jain <jainva@google.com> wrote:

Hi Everyone,

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Please find a patch for the same.

Thanks,
Vaibhav
<0001-Fix-overflow-of-nbatch.patch>

I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));
So, in practice, nbatch won’t grow to very big. But yes, if nbatch reaches to, say 1 million, it will overflow.

A simple program proves that changing nbatch to size_t will prevent from overflowing:

```
#include <stdio.h>

int main(){
size_t nbatch = 1000000; // 1 million
int BLCKSZ = 8192;
size_t result = 2 * nbatch * BLCKSZ;
printf("%zu\n", result); // will output 16384000000
return 0;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
> I guess that because earlier in the function, nbatch is always clamped with:
>
> nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David



Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:
> On 9/22/25 22:45, David Rowley wrote:
> > I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
> > in the following two lines because the final result is a size_t:
> >
> > size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
> > size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
> >
>
> Yeah, I failed to notice this part of the formula can overflow.

Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?

David



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 9/23/25 02:02, David Rowley wrote:
> On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:
>> On 9/22/25 22:45, David Rowley wrote:
>>> I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
>>> in the following two lines because the final result is a size_t:
>>>
>>> size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
>>> size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
>>>
>>
>> Yeah, I failed to notice this part of the formula can overflow.
> 
> Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
> over, should I take care of this, or do you want to?
> 

Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.

BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.

However, looking at the code now, I think the code should adjust the
hash_table_bytes value, not just space_allowed. It's meant to be the
same thing. Will check tomorrow.


thanks

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:
>
> On 9/23/25 02:02, David Rowley wrote:
> > Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
> > over, should I take care of this, or do you want to?
> >
>
> Feel free to fix, but I can take care of it once 18 is out the door.
> It's my bug, after all.
>
> BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I think it's probably best you handle this. I didn't notice that one.
You know this area much better than I do.

> I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
> 8KB blocks), which is a lot. But with the balancing logic, it'd also
> mean each batch is about ~2GB. So the whole "hash table" would be about
> 500GB. Possible, but unlikely.

I think no matter how low the chances of overflow are, the code isn't
written the way it was intended to be, so it should just be put the
way it was intended to be without question of the likelihood of
overflow. Otherwise, we'll just end up with harder to hit bugs which
could take much longer to [re]discover. Also, in these terms, what's
unlikely today may not be in the future.

David



Re: Fix overflow of nbatch

От
Chao Li
Дата:


On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

Sorry for the misleading. I actually meant “minbatch”.

I remember I ever traced the function several times. First, with a normal (not much data involved) query, 

if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{

Is hard to meet, then nbatch will be just 1.

With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.

To clarify, I just created a test:

```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-#   ON o.id = i.id;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
   Hash Cond: (i.id = o.id)
   Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
   ->  Seq Scan on inner_tbl i  (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
         Buffers: local read=50048 dirtied=50000 written=49993
   ->  Hash  (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
         Buckets: 262144  Batches: 8  Memory Usage: 6446kB
         Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
         ->  Seq Scan on outer_tbl o  (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
               Buffers: local read=4480 dirtied=4425 written=4425
 Planning:
   Buffers: shared hit=14
 Planning Time: 389649.420 ms
 Execution Time: 1362.392 ms
(14 rows)
```

In this test, minbatch is just 64.

But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Fix overflow of nbatch

От
Vaibhav Jain
Дата:
Thank you everyone for the reviews, feedback and the test.

Please find attached a new version of the patch.

On Tue, Sep 23, 2025 at 7:11 AM Chao Li <li.evan.chao@gmail.com> wrote:


On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

Sorry for the misleading. I actually meant “minbatch”.

I remember I ever traced the function several times. First, with a normal (not much data involved) query, 

if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{

Is hard to meet, then nbatch will be just 1.

With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.

To clarify, I just created a test:

```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-#   ON o.id = i.id;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
   Hash Cond: (i.id = o.id)
   Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
   ->  Seq Scan on inner_tbl i  (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
         Buffers: local read=50048 dirtied=50000 written=49993
   ->  Hash  (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
         Buckets: 262144  Batches: 8  Memory Usage: 6446kB
         Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
         ->  Seq Scan on outer_tbl o  (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
               Buffers: local read=4480 dirtied=4425 written=4425
 Planning:
   Buffers: shared hit=14
 Planning Time: 389649.420 ms
 Execution Time: 1362.392 ms
(14 rows)
```

In this test, minbatch is just 64.

But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Вложения

Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 9/23/25 03:20, David Rowley wrote:
> On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 9/23/25 02:02, David Rowley wrote:
>>> Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
>>> over, should I take care of this, or do you want to?
>>>
>>
>> Feel free to fix, but I can take care of it once 18 is out the door.
>> It's my bug, after all.
>>
>> BTW ExecHashIncreaseBatchSize needs the same fix, I think.
> 
> I think it's probably best you handle this. I didn't notice that one.
> You know this area much better than I do.
> 

OK, will do.

>> I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
>> 8KB blocks), which is a lot. But with the balancing logic, it'd also
>> mean each batch is about ~2GB. So the whole "hash table" would be about
>> 500GB. Possible, but unlikely.
> 
> I think no matter how low the chances of overflow are, the code isn't
> written the way it was intended to be, so it should just be put the
> way it was intended to be without question of the likelihood of
> overflow. Otherwise, we'll just end up with harder to hit bugs which
> could take much longer to [re]discover. Also, in these terms, what's
> unlikely today may not be in the future.
> 

I wasn't disputing the validity of the bug. I was just thinking alund
about how likely it's to hit.


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
Hi,

I kept looking at this, and unfortunately the it seems a bit worse :-(

Fixing the overflow is easy enough - adding the cast does the trick.

But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.

This breaks the "balancing" as the nbatch gets decreased until

    nbatch <= (work_mem / BLCKSZ)

while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(

I also noticed the code does not clamp nbuckets properly as it should.


The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.

But with 18 already stamped ...

I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tom Lane
Дата:
Tomas Vondra <tomas@vondra.me> writes:
> The question what to do about this. If we got this a week ago, I'd just
> probably just revert a1b4f289, and then try again for PG19. After all,
> the issue it meant to address is somewhat rare.
> But with 18 already stamped ...

18.0 is what it is.  If this is the most serious bug in it, I'll be
a bit surprised.  Take your time, fix it properly.

            regards, tom lane



Re: Fix overflow of nbatch

От
Nathan Bossart
Дата:
On Tue, Sep 23, 2025 at 11:34:56AM -0400, Tom Lane wrote:
> 18.0 is what it is.  If this is the most serious bug in it, I'll be
> a bit surprised.  Take your time, fix it properly.

+1

-- 
nathan



Re: Fix overflow of nbatch

От
Konstantin Knizhnik
Дата:
On 23/09/2025 6:11 PM, Tomas Vondra wrote:

> Hi,
>
> I kept looking at this, and unfortunately the it seems a bit worse :-(
>
> Fixing the overflow is easy enough - adding the cast does the trick.
>
> But then there's the second issue I mentioned - the loop does not adjust
> the hash_table_bytes. It updates the *space_allowed, but that's not what
> the current_space/new_space formulas use.
>
> This breaks the "balancing" as the nbatch gets decreased until
>
>      nbatch <= (work_mem / BLCKSZ)
>
> while the hash table "space_allowed" is increasing. This may result in
> an *increased* memory usage :-(
>
> I also noticed the code does not clamp nbuckets properly as it should.
>
>
> The question what to do about this. If we got this a week ago, I'd just
> probably just revert a1b4f289, and then try again for PG19. After all,
> the issue it meant to address is somewhat rare.
>
> But with 18 already stamped ...
>
> I've shared these findings with the rest of the RMT, I'll see what their
> thoughts are. Of course, other opinions/suggestions are welcome.
>
>
> regards
>


Hi Tomas,

If you are going to investigate this problem more, can you also look at 
the related problem:


https://www.postgresql.org/message-id/flat/52b94d5b-a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936

I proposed the patch but got no feedback.


Best regards,
Konstantin





Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 9/23/25 21:13, Konstantin Knizhnik wrote:
> On 23/09/2025 6:11 PM, Tomas Vondra wrote:
> 
>> Hi,
>>
>> I kept looking at this, and unfortunately the it seems a bit worse :-(
>>
>> Fixing the overflow is easy enough - adding the cast does the trick.
>>
>> But then there's the second issue I mentioned - the loop does not adjust
>> the hash_table_bytes. It updates the *space_allowed, but that's not what
>> the current_space/new_space formulas use.
>>
>> This breaks the "balancing" as the nbatch gets decreased until
>>
>>      nbatch <= (work_mem / BLCKSZ)
>>
>> while the hash table "space_allowed" is increasing. This may result in
>> an *increased* memory usage :-(
>>
>> I also noticed the code does not clamp nbuckets properly as it should.
>>
>>
>> The question what to do about this. If we got this a week ago, I'd just
>> probably just revert a1b4f289, and then try again for PG19. After all,
>> the issue it meant to address is somewhat rare.
>>
>> But with 18 already stamped ...
>>
>> I've shared these findings with the rest of the RMT, I'll see what their
>> thoughts are. Of course, other opinions/suggestions are welcome.
>>
>>
>> regards
>>
> 
> 
> Hi Tomas,
> 
> If you are going to investigate this problem more, can you also look at
> the related problem:
> 
> https://www.postgresql.org/message-id/flat/52b94d5b-
> a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936
> 
> I proposed the patch but got no feedback.
> 

Thanks, I'll take a look. I wasn't aware of that thread (or rather I
didn't realize it might have been related to this). And AFAICS the
max_batches business is separate.

But the last bit seems very relevant:

-    while (nbatch > 0)
+    while (nbatch > 0 &&
+       nbuckets * 2 <= max_pointers) /* prevent allocation limit
overflow */

I believe this is the missing "nbucket claiming" that I mentioned above.
But I think it needs to work a bit differently. The max_pointers is
calculated from work_mem, but the whole point here is to grow the hash
table beyond that. I think it makes sense to relax that limit, and allow
up to MaxAllocSize, or something like that.



regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
Hi,

Here's a couple draft patches fixing the bug:

- 0001 adds the missing size_t cast, to fix the overflow

- 0002 fixes the balancing, by adjusting the hash table size limit

- 0003 adds the missing overflow protection for nbuckets and the hash
table limit

- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.


0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).

At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]. It'd not need the loop, but it'd still need all this
new overflow protection etc.


I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:

  create table t (a bigint, b text);
  insert into t select i, md5(i::text)
                  from generate_series(1,4000000000) s(i);

and I added this log message at the end of ExecChooseHashTableSize:

  elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
       work_mem, nbatch, nbuckets, (*space_allowed)/1024,
       (*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);

and I ran an explain on a self-join

  set enable_mergejoin = off;
  set max_parallel_workers_per_gather = 0;
  set work_mem = '...';

  explain select * from t t1 join t t2 on (t1.a = t2.a);

with work_mem set to values between 64kB and 1GB.

On 18.0 I got this:

  wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
  wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
  wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
  wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
  wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
  wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
  wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:

  wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
  wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
  wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
  wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
  wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
  wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
  wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
  wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
  wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
  wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.

With the attached patches, this looks like this:

  wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
  wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
  wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
  wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
  wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
  wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

So, this time it actually seems to work correctly and significantly
reduces the memory usage ...

There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.

The reason is that when calculating nbatch we do this:

    dbatch = Min(dbatch, max_pointers);

and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.

AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.


regards

[1]
https://www.postgresql.org/message-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_%2BzecoZYMU%2Bdn3BVVaZg%40mail.gmail.com

-- 
Tomas Vondra

Вложения

Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Here's a couple draft patches fixing the bug:
>
> - 0001 adds the missing size_t cast, to fix the overflow
> - 0002 fixes the balancing, by adjusting the hash table size limit
> - 0003 adds the missing overflow protection for nbuckets and the hash
> table limit

I've attached an alternative patch idea. It's shorter because I
avoided multiplication using algebraic equivalence. There are two
overflow checks for nbuckets and that max_pointers will be able to be
allocated, but otherwise, it avoids most of the overflow risk by
switching to division.

On the topic of max_pointers, I think there is no need to handle >
MaxAllocSize. We were willing to cope with the original values of
nbatch and nbuckets, we are just trying to optimize that. If doing so
would make us run out of memory for the arrays of poitners to buckets,
just use the last hashtable size that didn't have this problem. That
means we don't have to do any shenanigans to have powers of two for
nbuckets because we don't do clamping.

Also, I think the outer loop needs the condition nbatch > 1 not nbatch
> 0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.

I'm still waiting for the 4billion row table to be created on my
machine, so I haven't verified that I get the same results as you yet.

> - 0004 rewords the comment explaining how the balancing works. Reading
> it after a couple months, I found it overly verbose / not very clear.
> I'm sure it could be improved even further.

My attached patch does build on your revised wording.

- Melanie

Вложения

Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 10/7/25 23:10, Melanie Plageman wrote:
> On Wed, Sep 24, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Here's a couple draft patches fixing the bug:
>>
>> - 0001 adds the missing size_t cast, to fix the overflow
>> - 0002 fixes the balancing, by adjusting the hash table size limit
>> - 0003 adds the missing overflow protection for nbuckets and the hash
>> table limit
> 
> I've attached an alternative patch idea. It's shorter because I
> avoided multiplication using algebraic equivalence. There are two
> overflow checks for nbuckets and that max_pointers will be able to be
> allocated, but otherwise, it avoids most of the overflow risk by
> switching to division.
> 

Thanks for looking!

> On the topic of max_pointers, I think there is no need to handle >
> MaxAllocSize. We were willing to cope with the original values of
> nbatch and nbuckets, we are just trying to optimize that. If doing so
> would make us run out of memory for the arrays of poitners to buckets,
> just use the last hashtable size that didn't have this problem. That
> means we don't have to do any shenanigans to have powers of two for
> nbuckets because we don't do clamping.
> 

Hmm, so if I understand correctly you suggest stop when nbuckets gets
too high, while my code allowed reducing nbatches further (and just
capped nbatches). I'm fine with this change, if it makes the code
simpler, that means we allow ~130M buckets, which seems rather unlikely
to be a problem in practice. That means 1GB for buckets alone, and
tuples tend to be a multiple of that. With 4GB total, that's ~256k
batches. And multiplied by the 130M that would be 3.5e13 tuples ...

However, I don't understand why the condition bothers about INT_MAX?

    /* Ensure that nbuckets * 2 doesn't overflow an int */
    if (nbuckets > INT_MAX / 2)
        break;

AFAIK what matters is whether the buckets fit into MaxAllocSize. But
INT_MAX (or even INT_MAX/2) would allocate way more than that. The
original code (copied from an earlier part of the function) points out
the INT_MAX check is redundant, given the current MaxAllocSize value.

It's however true that if we double nbuckets and nbatch at the same
time, we don't need to bother doing this:

    max_pointers = hash_table_bytes / sizeof(HashJoinTuple);

because we can never break. Although, is that really true? When picking
the initial nbuckets value we do this:

    nbuckets = Max(nbuckets, 1024);

What if this increased the nbuckets (it'd require extremely low work_mem
of course)? Could it lead to unexpectedly high nbuckets in the loop?


Similarly, why does this check care about number of buckets?

    if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
        break;

I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
(hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
size of the hash table in bytes, not in number of items. Also, see how
get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
continue doing that.


I like the idea of simplifying the stop condition, instead of
calculating the old/new size, and comparing those. But is this actually
correct?

    if (nbatch / 4 < hash_table_bytes / BLCKSZ)
        break;

If we start with the "full" condition

 hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)

subtract hash_bytes from both sides

  (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)

subtract (nbatch * BLCKSZ)

  nbatch * BLCKSZ < hash_bytes

that gives us

  nbatch < (hash_bytes / BLCKSZ)

So where did the /4 come from? Also, maybe it'd be easier to just do

  (nbatch * BLCKSZ) < hash_bytes

i.e. without the division.

> Also, I think the outer loop needs the condition nbatch > 1 not nbatch
>> 0 -- when nbatch is 1, once we divide it by 2, it would end up as 0.
> 

Good point. I was wondering why we're not seeing any failures because of
this (e.g. from tests using tiny amounts of data, with a single batch).
But we immediately stop the condition, because the two batch files use
much less than the minimum work_mem value (even without the multiplier).

So it's benign, but let's not do that anyway.

> I'm still waiting for the 4billion row table to be created on my
> machine, so I haven't verified that I get the same results as you yet.
> 
>> - 0004 rewords the comment explaining how the balancing works. Reading
>> it after a couple months, I found it overly verbose / not very clear.
>> I'm sure it could be improved even further.
> 
> My attached patch does build on your revised wording.
> 

Thanks. I'll re-read the comment tomorrow.


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 10/7/25 23:10, Melanie Plageman wrote:
>
> Hmm, so if I understand correctly you suggest stop when nbuckets gets
> too high, while my code allowed reducing nbatches further (and just
> capped nbatches). I'm fine with this change, if it makes the code
> simpler, that means we allow ~130M buckets, which seems rather unlikely
> to be a problem in practice. That means 1GB for buckets alone, and
> tuples tend to be a multiple of that. With 4GB total, that's ~256k
> batches. And multiplied by the 130M that would be 3.5e13 tuples ...
>
> However, I don't understand why the condition bothers about INT_MAX?
>
>     /* Ensure that nbuckets * 2 doesn't overflow an int */
>     if (nbuckets > INT_MAX / 2)
>         break;

You're right. This can just be
        if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
            break;

> It's however true that if we double nbuckets and nbatch at the same
> time, we don't need to bother doing this:
>
>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>
> because we can never break. Although, is that really true? When picking
> the initial nbuckets value we do this:
>
>     nbuckets = Max(nbuckets, 1024);
>
> What if this increased the nbuckets (it'd require extremely low work_mem
> of course)? Could it lead to unexpectedly high nbuckets in the loop?

If we are worried about nbuckets exceeding what can be allocated, then
I think the proposed condition above takes care of that

        if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
            break;

As for whether or not bumping nbuckets to 1024 before the loop means
we can have very high values with low work_mem: it seems like even
with the minimum work_mem, the number of buckets is larger than that.
When could we hit this? Maybe if the number of skew buckets is very
large?

> Similarly, why does this check care about number of buckets?
>
>     if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>         break;
>
> I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
> (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
> size of the hash table in bytes, not in number of items. Also, see how
> get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
> continue doing that.

Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:

hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
sizeof(HashJoinTuple) / 2

but I don't think we need this because nbuckets should always be
bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
out with, it was clamped to Max(1024, hash_table_bytes /
sizeof(HashJoinTuple)).

The only exception would be if  MaxAllocSize / sizeof(HashJoinTuple)
was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
cover that anyway.

> I like the idea of simplifying the stop condition, instead of
> calculating the old/new size, and comparing those. But is this actually
> correct?
>
>     if (nbatch / 4 < hash_table_bytes / BLCKSZ)
>         break;
>
> If we start with the "full" condition
>
>  hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
>
> subtract hash_bytes from both sides
>
>   (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
>
> subtract (nbatch * BLCKSZ)
>
>   nbatch * BLCKSZ < hash_bytes
>
> that gives us
>
>   nbatch < (hash_bytes / BLCKSZ)
>
> So where did the /4 come from?

Yes, I made the mistake of doubling the batches and hash_table_bytes
again, forgetting that the original formula was already comparing the
hypothetical space to the current space.

I think it should be
        if (nbatch < hash_table_bytes / BLCKSZ)
as you say

> Also, maybe it'd be easier to just do
>   (nbatch * BLCKSZ) < hash_bytes
>
> i.e. without the division.

I prefer the division to avoid as many potentially overflow causing
operations as possible. otherwise we would have to check that nbatch *
BLCKSZ doesn't overflow first.

> > I'm still waiting for the 4billion row table to be created on my
> > machine, so I haven't verified that I get the same results as you yet.

I have updated my patch to fix the mistakes above. I also noticed then
that I wasn't doubling space_allowed in the loop but instead setting
it to hash_table_bytes at the end. This doesn't produce a power of 2
because we subtract skew_mcvs from the hash_table_bytes. So, we have
to keep using space_allowed if we want a power of 2 in the end.

I've changed my patch to do this, but this made me wonder if we want
to be doing this or instead take hash_table_bytes at the end and round
it up to a power of 2 and set space_allowed to that. If the skew
hashtable is large, we may be allocating way more space_allowed than
we need for new hash_table_bytes + skew hashtable buckets.

Oh, and I get the same logging output results as your patch with attached v2.

- Melanie

Вложения

Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Wed, Oct 8, 2025 at 1:37 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I have updated my patch to fix the mistakes above. I also noticed then
> that I wasn't doubling space_allowed in the loop but instead setting
> it to hash_table_bytes at the end. This doesn't produce a power of 2
> because we subtract skew_mcvs from the hash_table_bytes. So, we have
> to keep using space_allowed if we want a power of 2 in the end.
>
> I've changed my patch to do this, but this made me wonder if we want
> to be doing this or instead take hash_table_bytes at the end and round
> it up to a power of 2 and set space_allowed to that. If the skew
> hashtable is large, we may be allocating way more space_allowed than
> we need for new hash_table_bytes + skew hashtable buckets.

Oh wait, that doesn't make sense because each batch could have a skew hashtable.

- Melanie



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 10/8/25 19:37, Melanie Plageman wrote:
> On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 10/7/25 23:10, Melanie Plageman wrote:
>>
>> Hmm, so if I understand correctly you suggest stop when nbuckets gets
>> too high, while my code allowed reducing nbatches further (and just
>> capped nbatches). I'm fine with this change, if it makes the code
>> simpler, that means we allow ~130M buckets, which seems rather unlikely
>> to be a problem in practice. That means 1GB for buckets alone, and
>> tuples tend to be a multiple of that. With 4GB total, that's ~256k
>> batches. And multiplied by the 130M that would be 3.5e13 tuples ...
>>
>> However, I don't understand why the condition bothers about INT_MAX?
>>
>>     /* Ensure that nbuckets * 2 doesn't overflow an int */
>>     if (nbuckets > INT_MAX / 2)
>>         break;
> 
> You're right. This can just be
>         if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>             break;
> 
>> It's however true that if we double nbuckets and nbatch at the same
>> time, we don't need to bother doing this:
>>
>>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>>
>> because we can never break. Although, is that really true? When picking
>> the initial nbuckets value we do this:
>>
>>     nbuckets = Max(nbuckets, 1024);
>>
>> What if this increased the nbuckets (it'd require extremely low work_mem
>> of course)? Could it lead to unexpectedly high nbuckets in the loop?
> 
> If we are worried about nbuckets exceeding what can be allocated, then
> I think the proposed condition above takes care of that
> 
>         if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>             break;
> 
> As for whether or not bumping nbuckets to 1024 before the loop means
> we can have very high values with low work_mem: it seems like even
> with the minimum work_mem, the number of buckets is larger than that.
> When could we hit this? Maybe if the number of skew buckets is very
> large?
> 
>> Similarly, why does this check care about number of buckets?
>>
>>     if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>>         break;
>>
>> I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
>> (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
>> size of the hash table in bytes, not in number of items. Also, see how
>> get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
>> continue doing that.
> 
> Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
> 
> hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
> sizeof(HashJoinTuple) / 2
> 

But the hash table is not allocated as a single chunk of memory, so I
think MaxAllocSize would be the wrong thing to use here, no? Hash table
is a collection of tuples (allocated one by one), that's why
get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
nbuckets, because that indeed is allocated as a contiguous array, ofc.

Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
we get

    hash_table_bytes > MaxAllocSize / 2

but again, that doesn't make much sense - the hash table can be larger,
it's not a single palloc.

> but I don't think we need this because nbuckets should always be
> bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
> out with, it was clamped to Max(1024, hash_table_bytes /
> sizeof(HashJoinTuple)).
> > The only exception would be if  MaxAllocSize / sizeof(HashJoinTuple)
> was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
> case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
> cover that anyway.
> 

I'm confused about what this check was meant to do. The check said this

    /*
     * Ensure that hash_table_bytes * 2 doesn't exceed MaxAllocSize /
     * sizeof(HashJoinTuple)
     */
    if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
        break;

And I think it should be just

    if (hash_table_bytes > SIZE_MAX / 2)
        break;

as a protection against hash_table_bytes overflowing SIZE_MAX (which on
64-bits seems implausible, but could happen on 32-bit builds).

Also, how is this related to nbuckets?


>> I like the idea of simplifying the stop condition, instead of
>> calculating the old/new size, and comparing those. But is this actually
>> correct?
>>
>>     if (nbatch / 4 < hash_table_bytes / BLCKSZ)
>>         break;
>>
>> If we start with the "full" condition
>>
>>  hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
>>
>> subtract hash_bytes from both sides
>>
>>   (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
>>
>> subtract (nbatch * BLCKSZ)
>>
>>   nbatch * BLCKSZ < hash_bytes
>>
>> that gives us
>>
>>   nbatch < (hash_bytes / BLCKSZ)
>>
>> So where did the /4 come from?
> 
> Yes, I made the mistake of doubling the batches and hash_table_bytes
> again, forgetting that the original formula was already comparing the
> hypothetical space to the current space.
> 
> I think it should be
>         if (nbatch < hash_table_bytes / BLCKSZ)
> as you say
> 
>> Also, maybe it'd be easier to just do
>>   (nbatch * BLCKSZ) < hash_bytes
>>
>> i.e. without the division.
> 
> I prefer the division to avoid as many potentially overflow causing
> operations as possible. otherwise we would have to check that nbatch *
> BLCKSZ doesn't overflow first.
> 

Ah, I forgot about overflow again! Which is ironic, because this whole
thing is about overflows.

I really wish we had an infinite-precision-int ;-)

>>> I'm still waiting for the 4billion row table to be created on my
>>> machine, so I haven't verified that I get the same results as you yet.
> 
> I have updated my patch to fix the mistakes above. I also noticed then
> that I wasn't doubling space_allowed in the loop but instead setting
> it to hash_table_bytes at the end. This doesn't produce a power of 2
> because we subtract skew_mcvs from the hash_table_bytes. So, we have
> to keep using space_allowed if we want a power of 2 in the end.
> 
> I've changed my patch to do this, but this made me wonder if we want
> to be doing this or instead take hash_table_bytes at the end and round
> it up to a power of 2 and set space_allowed to that. If the skew
> hashtable is large, we may be allocating way more space_allowed than
> we need for new hash_table_bytes + skew hashtable buckets.
> 
> Oh, and I get the same logging output results as your patch with attached v2.
> 

Cool, in the worst case we're both wrong in the same way ;-)


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 10/8/25 21:16, Melanie Plageman wrote:
> On Wed, Oct 8, 2025 at 1:37 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>>
>> I have updated my patch to fix the mistakes above. I also noticed then
>> that I wasn't doubling space_allowed in the loop but instead setting
>> it to hash_table_bytes at the end. This doesn't produce a power of 2
>> because we subtract skew_mcvs from the hash_table_bytes. So, we have
>> to keep using space_allowed if we want a power of 2 in the end.
>>
>> I've changed my patch to do this, but this made me wonder if we want
>> to be doing this or instead take hash_table_bytes at the end and round
>> it up to a power of 2 and set space_allowed to that. If the skew
>> hashtable is large, we may be allocating way more space_allowed than
>> we need for new hash_table_bytes + skew hashtable buckets.
> 

I don't think there's any promise hash_table_bytes being a power of 2.
You can make hash_table_bytes an almost arbitrary value by setting
work_mem and hash_mem_multiplier. Or am I missing something?

But you're right hash_table_bytes and space_allowed may not be equal if
useskew=true. So setting space_allowed to hash_table_bytes at the end
does not seem right. I think we don't actually need hash_table_bytes at
this point, we can just ignore it, and use/double *space_allowed.

I kept using hash_table_bytes mostly because it didn't require the
pointer dereferencing, but I failed to consider the useskew=true thing.

However, this means there's probably a bug - the loop should probably
double num_skew_mcvs too. We simply reserve SKEW_HASH_MEM_PERCENT of
space_allowed for skew hashtable, so should we adjust it the same way?

> Oh wait, that doesn't make sense because each batch could have a skew hashtable.
> 

Not sure I understand. Is this the same issue I just described?

regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 10/8/25 19:37, Melanie Plageman wrote:
>
> > Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
> >
> > hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
> > sizeof(HashJoinTuple) / 2
>
> But the hash table is not allocated as a single chunk of memory, so I
> think MaxAllocSize would be the wrong thing to use here, no? Hash table
> is a collection of tuples (allocated one by one), that's why
> get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
> nbuckets, because that indeed is allocated as a contiguous array, ofc.
>
> Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
> we get
>
>     hash_table_bytes > MaxAllocSize / 2
>
> but again, that doesn't make much sense - the hash table can be larger,
> it's not a single palloc.

It came from the earlier clamping of nbuckets:

    max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
    max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
    dbuckets = Min(dbuckets, max_pointers);

I don't really get why this divides hash_table_bytes by
sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
to accommodate both the bucket_bytes and inner_rel_bytes.

> And I think it should be just
>
>     if (hash_table_bytes > SIZE_MAX / 2)
>         break;
>
> as a protection against hash_table_bytes overflowing SIZE_MAX (which on
> 64-bits seems implausible, but could happen on 32-bit builds).

That's roughly the check I ended up with -- except I used
space_allowed because it will be larger than hash_table_bytes if there
is a skew hashtable. I've updated the comment and such in attached v3.

- Melanie

Вложения

Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 10/9/25 16:16, Melanie Plageman wrote:
> On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 10/8/25 19:37, Melanie Plageman wrote:
>>
>>> Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
>>>
>>> hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
>>> sizeof(HashJoinTuple) / 2
>>
>> But the hash table is not allocated as a single chunk of memory, so I
>> think MaxAllocSize would be the wrong thing to use here, no? Hash table
>> is a collection of tuples (allocated one by one), that's why
>> get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
>> nbuckets, because that indeed is allocated as a contiguous array, ofc.
>>
>> Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
>> we get
>>
>>     hash_table_bytes > MaxAllocSize / 2
>>
>> but again, that doesn't make much sense - the hash table can be larger,
>> it's not a single palloc.
> 
> It came from the earlier clamping of nbuckets:
> 
>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>     max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
>     dbuckets = Min(dbuckets, max_pointers);
> 
> I don't really get why this divides hash_table_bytes by
> sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
> to accommodate both the bucket_bytes and inner_rel_bytes.
> 

I think that's simply to allocate enough buckets for the the expected
number of tuples, not more. And cap it to MaxAllocSize. Some of this may
be redundant, I suppose.

>> And I think it should be just
>>
>>     if (hash_table_bytes > SIZE_MAX / 2)
>>         break;
>>
>> as a protection against hash_table_bytes overflowing SIZE_MAX (which on
>> 64-bits seems implausible, but could happen on 32-bit builds).
> 
> That's roughly the check I ended up with -- except I used
> space_allowed because it will be larger than hash_table_bytes if there
> is a skew hashtable. I've updated the comment and such in attached v3.
> 

Ah, thanks. I was just hacking on this too, but I'll switch to your v3.


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Wed, Oct 8, 2025 at 5:08 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 10/8/25 21:16, Melanie Plageman wrote:
> > On Wed, Oct 8, 2025 at 1:37 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> >>
> >> I have updated my patch to fix the mistakes above. I also noticed then
> >> that I wasn't doubling space_allowed in the loop but instead setting
> >> it to hash_table_bytes at the end. This doesn't produce a power of 2
> >> because we subtract skew_mcvs from the hash_table_bytes. So, we have
> >> to keep using space_allowed if we want a power of 2 in the end.
> >>
> >> I've changed my patch to do this, but this made me wonder if we want
> >> to be doing this or instead take hash_table_bytes at the end and round
> >> it up to a power of 2 and set space_allowed to that. If the skew
> >> hashtable is large, we may be allocating way more space_allowed than
> >> we need for new hash_table_bytes + skew hashtable buckets.
> >
>
> I don't think there's any promise hash_table_bytes being a power of 2.
> You can make hash_table_bytes an almost arbitrary value by setting
> work_mem and hash_mem_multiplier. Or am I missing something?

Right. Your example used powers of 2 for work_mem, which is why I saw
power of 2 output for space_allowed, but that is arbitrary.

> But you're right hash_table_bytes and space_allowed may not be equal if
> useskew=true. So setting space_allowed to hash_table_bytes at the end
> does not seem right. I think we don't actually need hash_table_bytes at
> this point, we can just ignore it, and use/double *space_allowed.
>
> I kept using hash_table_bytes mostly because it didn't require the
> pointer dereferencing, but I failed to consider the useskew=true thing.

We could make some local variable of space_allowed if we want.

> However, this means there's probably a bug - the loop should probably
> double num_skew_mcvs too. We simply reserve SKEW_HASH_MEM_PERCENT of
> space_allowed for skew hashtable, so should we adjust it the same way?

We do need to recalculate num_skew_mcvs once at the end before returning.

But I don't think we need to do it on each iteration of the loop. We
want the total size of what is in memory -- the regular and skew
hashtable -- to be the condition we break on. I think that's
space_allowed.

> > Oh wait, that doesn't make sense because each batch could have a skew hashtable.
>
> Not sure I understand. Is this the same issue I just described?

Yea, what I meant is that we are increasing the size of the hashtable
and decreasing the number of batches on the assumption that we might
end up with tuples from multiple batches consolidated down to one
batch and one larger hashtable. Each of those batches could have skew
buckets so we now end up needing a larger skew hashtable too, not just
a larger regular hashtable.

- Melanie



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 10/9/25 16:30, Tomas Vondra wrote:
> On 10/9/25 16:16, Melanie Plageman wrote:
>> On Wed, Oct 8, 2025 at 4:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>> On 10/8/25 19:37, Melanie Plageman wrote:
>>>
>>>> Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
>>>>
>>>> hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
>>>> sizeof(HashJoinTuple) / 2
>>>
>>> But the hash table is not allocated as a single chunk of memory, so I
>>> think MaxAllocSize would be the wrong thing to use here, no? Hash table
>>> is a collection of tuples (allocated one by one), that's why
>>> get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
>>> nbuckets, because that indeed is allocated as a contiguous array, ofc.
>>>
>>> Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
>>> we get
>>>
>>>     hash_table_bytes > MaxAllocSize / 2
>>>
>>> but again, that doesn't make much sense - the hash table can be larger,
>>> it's not a single palloc.
>>
>> It came from the earlier clamping of nbuckets:
>>
>>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>>     max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
>>     dbuckets = Min(dbuckets, max_pointers);
>>
>> I don't really get why this divides hash_table_bytes by
>> sizeof(HashJoinTuple) since, as you say, hash_table_bytes is supposed
>> to accommodate both the bucket_bytes and inner_rel_bytes.
>>
> 
> I think that's simply to allocate enough buckets for the the expected
> number of tuples, not more. And cap it to MaxAllocSize. Some of this may
> be redundant, I suppose.
> 
>>> And I think it should be just
>>>
>>>     if (hash_table_bytes > SIZE_MAX / 2)
>>>         break;
>>>
>>> as a protection against hash_table_bytes overflowing SIZE_MAX (which on
>>> 64-bits seems implausible, but could happen on 32-bit builds).
>>
>> That's roughly the check I ended up with -- except I used
>> space_allowed because it will be larger than hash_table_bytes if there
>> is a skew hashtable. I've updated the comment and such in attached v3.
>>
> 
> Ah, thanks. I was just hacking on this too, but I'll switch to your v3.
> 

Here's a v4, which is your v3 with a couple minor adjustments:

1) A couple comments adjusted. It feels a bit too audacious to correct
comments written by native speaker, but it seems cleaner to me like this.

2) I think the (nbatch < hash_table_bytes / BLCKSZ) condition really
needs to check space_allowed, because that's the whole space, before
subtracting the skew buckets. And we also check the overflow for
space_allowed earlier, not hash_table_bytes.

3) removes the hash_table_bytes doubling, because it's not needed by the
loop anymore (or after that)

4) added the doubling of num_skew_mcvs, and also the overflow protection
for that

You suggested this in another message:

> We do need to recalculate num_skew_mcvs once at the end before
> returning.

But I think the doubling has the same effect, right? I don't want to
redo the whole "if (useskew) { ... }" block at the end.

I wonder if maybe it'd be better to just merge all the overflow checks
into a single "if (a || b || c)" check. These separate checks seem quite
verbose. I'll see tomorrow.

I've done a bit more testing today. I went a bit overboard and created a
table with 20B rows, which actually pushes nbatch high enough to hit the
initial issue (overflow in the size calculation). And it works sanely
with the v4 patch (and v3 too).

I guess I could have tweaked the table stats, or just manipulate the
values at the beginning of the function. But that wouldn't be so fun.


regards

-- 
Tomas Vondra

Вложения

Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> 1) A couple comments adjusted. It feels a bit too audacious to correct
> comments written by native speaker, but it seems cleaner to me like this.

I attached a patch with a few more suggested adjustments (0003). The
more substantive tweaks are:

I don't really like that this comment says it is about nbuckets
overflowing MaxAllocSize because overflow means something specific and
this sounds like we are saying the nbuckets variable will overflow
MaxAllocSize but what we mean is that nbuckets worth of HashJoinTuples
could overflow MaxAllocSize. You don't have to use my wording, but I'm
not sure about this wording either.

/* Check that nbuckets wont't overflow MaxAllocSize */
if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
        break;

Also, upon re-reading the comment we wrote together:

        * With extremely low work_mem values, nbuckets may have been set
        * higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
        * to correct that here, we accept nbuckets to be oversized.

I'm not so sure it belongs above the nbuckets allocation check.
Perhaps we should move it to where we are doubling nbuckets. Or even
above where we clamp it to 1024.

I'm actually wondering if we want this comment at all. Does it
emphasize an edge case to a confusing point? I'm imagining myself
studying it in the future and having no idea what it means.

I've kept it in but moved it in 0003.

> 4) added the doubling of num_skew_mcvs, and also the overflow protection
> for that

Do we really need to check if num_skew_mcvs will overflow? Shouldn't
it always be smaller than nbuckets? Maybe it can be an assert.

> You suggested this in another message:
>
> > We do need to recalculate num_skew_mcvs once at the end before
> > returning.
>
> But I think the doubling has the same effect, right? I don't want to
> redo the whole "if (useskew) { ... }" block at the end.

Yea, it would have to be some kind of helper or something. I worried
just doubling num_skew_mcvs would drift significantly because of
integer truncation -- perhaps even a meaningful amount. But it was
just an intuition -- I didn't plug in any numbers and try.

- Melanie

Вложения

Re: Fix overflow of nbatch

От
Melanie Plageman
Дата:
On Mon, Oct 13, 2025 at 12:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <tomas@vondra.me> wrote:
> >
> > 1) A couple comments adjusted. It feels a bit too audacious to correct
> > comments written by native speaker, but it seems cleaner to me like this.
>
> I attached a patch with a few more suggested adjustments (0003)

Oh and I didn't add this but, all of the other pointer dereferences
have unnecessary (in terms of operator precedence) parentheses around
them, so, for consistency, I would put them around this (or remove
them everywhere since they are not needed)

if (nbatch < *space_allowed / BLCKSZ)
        break;

- Melanie



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 10/13/25 18:05, Melanie Plageman wrote:
> On Thu, Oct 9, 2025 at 7:36 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> 1) A couple comments adjusted. It feels a bit too audacious to correct
>> comments written by native speaker, but it seems cleaner to me like this.
> 
> I attached a patch with a few more suggested adjustments (0003). The
> more substantive tweaks are:
> 
> I don't really like that this comment says it is about nbuckets
> overflowing MaxAllocSize because overflow means something specific and
> this sounds like we are saying the nbuckets variable will overflow
> MaxAllocSize but what we mean is that nbuckets worth of HashJoinTuples
> could overflow MaxAllocSize. You don't have to use my wording, but I'm
> not sure about this wording either.
> 
> /* Check that nbuckets wont't overflow MaxAllocSize */
> if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>         break;
> 

I think the wording is fine, it just needs to talk about "buckets" and
not "nbuckets". I did that in the attached v6.

> Also, upon re-reading the comment we wrote together:
> 
>         * With extremely low work_mem values, nbuckets may have been set
>         * higher than hash_table_bytes / sizeof(HashJoinTuple). We don't try
>         * to correct that here, we accept nbuckets to be oversized.
> 
> I'm not so sure it belongs above the nbuckets allocation check.
> Perhaps we should move it to where we are doubling nbuckets. Or even
> above where we clamp it to 1024.
> 

Yeah, I'm not happy with the place either. But mentioning this above the
clamp to 1024 doesn't seem great either.

> I'm actually wondering if we want this comment at all. Does it
> emphasize an edge case to a confusing point? I'm imagining myself
> studying it in the future and having no idea what it means.
> 

You may be right. I thought someone might read the code in the future,
possibly while investigating a case when the loop stopped too early. And
will be puzzled, not realizing nbuckets might be too high. But frankly,
that's super unlikely. It only applies to cases with extremely low
work_mem values, that's quite unlikely on machines doing massive joins.

I'll think about it in the morning, but I'll probably remove it.

> I've kept it in but moved it in 0003.
> 
>> 4) added the doubling of num_skew_mcvs, and also the overflow protection
>> for that
> 
> Do we really need to check if num_skew_mcvs will overflow? Shouldn't
> it always be smaller than nbuckets? Maybe it can be an assert.
> 

Good point. I wasn't sure if that's guaranteed, but after looking at the
skew_mcvs calculation again I think you're right. So +1 to assert.

>> You suggested this in another message:
>>
>>> We do need to recalculate num_skew_mcvs once at the end before
>>> returning.
>>
>> But I think the doubling has the same effect, right? I don't want to
>> redo the whole "if (useskew) { ... }" block at the end.
> 
> Yea, it would have to be some kind of helper or something. I worried
> just doubling num_skew_mcvs would drift significantly because of
> integer truncation -- perhaps even a meaningful amount. But it was
> just an intuition -- I didn't plug in any numbers and try.
> 

I think the integer truncation should not matter. AFAIK it could be off
by 1 on the first loop (due to rounding), and then the error gets
doubled on every loop. So with 8 loops we might be off by 127, right?
But with the 2% SKEW_HASH_MEM_PERCENT that difference is negligible
compared to the actual skew_mcvs value, I think.

All these formulas are rough guesses based on arbitrary constants (like
SKEW_HASH_MEM_PERCENT) anyway.


I'll give this a bit more testing and review tomorrow, and then I'll
push. I don't want to hold this back through pgconf.eu.


regards

-- 
Tomas Vondra

Вложения

Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 10/14/25 23:13, Tomas Vondra wrote:
> ...
> 
> I'll give this a bit more testing and review tomorrow, and then I'll
> push. I don't want to hold this back through pgconf.eu.
> 

Pushed and backpatched, after some minor tweaks. Thanks for the reviews
and feedback. I consider this is fixed now.


One remaining tweak I've been experimenting with (for master) is fixing
the weird behavior I described at the end of [1]. The root cause is that
we cap nbuckets by max_pointers, which is the max number of pointers we
can fit into work_mem. The consequence is that increasing work_mem also
increases nbatch too, which is counter-intuitive. It's a bit strange, as
it caps the number of batch pointers, while it ignores the buffers that
are ~1000x larger.

I experimented with capping the nbatch by how many pointers fit into
MaxAllocSize (and INT_MAX).

    Min(MaxAllocSize / sizeof(void *), INT_MAX / 2 + 1);

But I think this does not really matter much in practice. First, this
only happens with low work_mem values, while systems doing large joins
tend to have work_mem increased. Second, this means the nbatch is set
too low, and it'll get "fixed" by the memory balaning at runtime.

So I thinks we don't need to do anything about this.


[1]
https://www.postgresql.org/message-id/244dc6c1-3b3d-4de2-b3de-b1511e6a6d10%40vondra.me

-- 
Tomas Vondra