Обсуждение: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Поиск
Список
Период
Сортировка

BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18349
Logged by:          Alexey Ermakov
Email address:      alexey.ermakov@dataegret.com
PostgreSQL version: 16.2
Operating system:   Rocky Linux release 9.3 (Blue Onyx)
Description:

Hello! I found a problem with memory consumption of parallel worker when
postgres 16.2 runs complex analytical query.
After simplifying I was able to reproduce it with following query with just
2 joins:

2024-02-16 12:14:25.669 UTC [425424] ERROR:  XX000: invalid DSA memory alloc
request size 1811939328
2024-02-16 12:14:25.669 UTC [425424] LOCATION:  dsa_allocate_extended,
dsa.c:690
2024-02-16 12:14:25.669 UTC [425424] STATEMENT:  SELECT *
        FROM history h
             LEFT JOIN coupon cpn ON cpn.recloc = h.recloc AND cpn.version =
h.version
             LEFT JOIN free_text i2 ON i2.recloc = cpn.recloc AND i2.part =
1
          WHERE h.date_operation >= '2024-02-13' and h.date_operation <
'2024-02-14';
2024-02-16 12:14:25.669 UTC [425424] DEBUG:  00000: shmem_exit(1): 5
before_shmem_exit callbacks to make
2024-02-16 12:14:25.669 UTC [425424] LOCATION:  shmem_exit, ipc.c:240

Here is some memory/parallel workers- related settings:
max_parallel_workers = 12
max_parallel_workers_per_gather = 2
hash_mem_multiplier = 2
work_mem = '4MB'

When I increase work_mem to 16MB then query runs successfully (same plan):

 Gather  (cost=1082271.57..8357367.15 rows=26292632 width=345) (actual
time=50735.451..64000.782 rows=1325161 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=4102227 read=7004436, temp read=1808967
written=1809044
   I/O Timings: shared read=37739.143, temp read=4809.969 write=9139.669
   ->  Parallel Hash Right Join  (cost=1081271.57..5727103.95 rows=10955263
width=345) (actual time=50716.477..61271.276 rows=441720 loops=3)
         Hash Cond: ((i2.recloc)::text = (cpn.recloc)::text)
         Buffers: shared hit=4102227 read=7004436, temp read=1808967
written=1809044
         I/O Timings: shared read=37739.143, temp read=4809.969
write=9139.669
         ->  Parallel Seq Scan on free_text i2  (cost=0.00..4160436.50
rows=74411061 width=63) (actual time=0.196..37944.130 rows=60240671
loops=3)
               Filter: (part = 1)
               Rows Removed by Filter: 160726692
               Buffers: shared hit=158869 read=7004436
               I/O Timings: shared read=37739.143
         ->  Parallel Hash  (cost=1077675.12..1077675.12 rows=220524
width=282) (actual time=1406.500..1406.503 rows=209531 loops=3)
               Buckets: 131072  Batches: 8  Memory Usage: 30496kB
               Buffers: shared hit=3943254, temp written=10316
               I/O Timings: temp write=58.043
               ->  Nested Loop Left Join  (cost=1.14..1077675.12 rows=220524
width=282) (actual time=0.048..1239.847 rows=209531 loops=3)
                     Buffers: shared hit=3943254
                     ->  Parallel Index Scan using history_index_date on
history h  (cost=0.57..42000.62 rows=156296 width=90) (actual
time=0.021..76.927 rows=198024 loops=3)
                           Index Cond: ((date_operation >= '2024-02-13
00:00:00'::timestamp without time zone) AND (date_operation < '2024-02-14
00:00:00'::timestamp without time zone))
                           Buffers: shared hit=260503
                     ->  Index Scan using coupon_index_rl on coupon cpn
(cost=0.57..6.62 rows=1 width=192) (actual time=0.004..0.005 rows=1
loops=594071)
                           Index Cond: ((recloc)::text = (h.recloc)::text)
                           Filter: (version = h.version)
                           Rows Removed by Filter: 2
                           Buffers: shared hit=3682751
 Planning:
   Buffers: shared hit=72
 Planning Time: 0.506 ms
 Execution Time: 64060.476 ms
(32 rows)


Interesting thing: after query fails it take some time (seconds) before psql
gives control back.
It turns out that postgresql written 1M+ (!) temporary files (most of them
are empty) and cleaning up takes some time.
With bigger work_mem it won't write so many files. Maybe memory issue
somehow related with number of files.

In PostgreSQL 16.2 there were fix for similar bug: [1],[2]. Could it be that
there is some issue with it ?

I wasn't able to make reproducible test case so far but may be given
information is enough...


[1]
https://github.com/postgres/postgres/commit/2a67b5a60ee68892bb028587ddc6de7650822480
[2]
https://www.postgresql.org/message-id/flat/16925-ec96d83529d0d629%40postgresql.org


Thanks,
Alexey Ermakov


On 16/2/2024 20:18, PG Bug reporting form wrote:

> Interesting thing: after query fails it take some time (seconds) before psql
> gives control back.
> It turns out that postgresql written 1M+ (!) temporary files (most of them
> are empty) and cleaning up takes some time.
Interesting. It correlates with one performance issue I have been trying 
to catch already 3 months. Could you provide some reproduction of that 
behavior?
> With bigger work_mem it won't write so many files. Maybe memory issue
> somehow related with number of files.
> 
> In PostgreSQL 16.2 there were fix for similar bug: [1],[2]. Could it be that
> there is some issue with it ?
Looks like surely the same issue we fixed recently: Parallel Hash Join + 
DSM is a sign of that problem.

-- 
regards,
Andrei Lepikhov
Postgres Professional




On 2024-02-16 20:40, Andrei Lepikhov wrote:
> Interesting. It correlates with one performance issue I have been 
> trying to catch already 3 months. Could you provide some reproduction 
> of that behavior?
>
Yes, I'm still trying to make reproducer, it will take some time. Thanks.

--

Alexey Ermakov




Hi,

On 2/20/24 15:05, Alexey Ermakov wrote:
> On 2024-02-16 20:40, Andrei Lepikhov wrote:
>> Interesting. It correlates with one performance issue I have been
>> trying to catch already 3 months. Could you provide some reproduction
>> of that behavior?
>>
> Yes, I'm still trying to make reproducer, it will take some time. Thanks.
> 

I wonder if this might be yet another manifestation of the hashjoin
batch explosion issue we have. The plan has a hash join, and the fact
that it runs with a bit more memory would be consistent too.

The hashjoin batch explosion happens when we find a batch that's too
large to fit into a work_mem, and increasing the number of batches does
not really make it smaller (e.g. because there's a lot of rows with
exactly the same key). We end up doubling the number of batches, but
each batch needs a 8kB file buffer, so it's not hard to consume a lot of
memory due to this. Chances are the DSA allocation fails simply because
the system hits overcommit limit, or something like that.

It's a bit weird it needs 1.8GB of memory, but perhaps that's also
linked to the number of batches, somehow?

Anyway, if you could set a breakpoint on the error, and see how many
batches the hash join has, that'd be helpful. I'd probably try doing
that with non-parallel query, it makes it easier to debug and it may
even report the number of batches if it completes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 21/2/2024 19:52, Tomas Vondra wrote:
> Hi,
> 
> On 2/20/24 15:05, Alexey Ermakov wrote:
>> On 2024-02-16 20:40, Andrei Lepikhov wrote:
>>> Interesting. It correlates with one performance issue I have been
>>> trying to catch already 3 months. Could you provide some reproduction
>>> of that behavior?
>>>
>> Yes, I'm still trying to make reproducer, it will take some time. Thanks.
>>
> It's a bit weird it needs 1.8GB of memory, but perhaps that's also
> linked to the number of batches, somehow?
I found one possible weak point in the code of PHJ:
ExecParallelHashJoinSetUpBatches:

pstate->batches = dsa_allocate0(hashtable->area,
    EstimateParallelHashJoinBatch(hashtable) * nbatch);

It could explain why we have such a huge memory allocation with a size 
not bonded to a power of 2.
Also, it makes more sense that we already have a second report [1] with 
the same value of DSA allocation.
So, can Alexey apply a trivial patch and check whether the bug has gone?

[1] 

https://www.postgresql.org/message-id/flat/PAYP264MB3368AF4D63704E370B204C39FF582%40PAYP264MB3368.FRAP264.PROD.OUTLOOK.COM

-- 
regards,
Andrei Lepikhov
Postgres Professional




Hello! I reproduced generation of multiple temporary files and abnormal 
memory
usage (in 'top' output both backend and parallel worker used around 3-4GB of
RAM each, I often got OOM killer) but not memory allocation error so far.

On machine where I first got error I always got the same request allocation
size despite usage of different tables and conditions. Interesting that same
number was on another bug report.

Here is what I used:
PostgreSQL 16.2 (Ubuntu 16.2-1.pgdg20.04+1), virtual machine with 8GB of 
RAM,
6GB swap and around 15-20GB free space for temp files.

Almost all settings in postgresql.conf are default:
=========================================================================
shared_buffers = 128MB
effective_cache_size = 4GB
work_mem = 1MB
hash_mem_multiplier = 2
max_parallel_workers_per_gather = 1
jit = off
random_page_cost = 4
seq_page_cost = 1

test data:
=========================================================================
--test_1 table, 3.5M rows, 1..3M, n_distinct -0.8
create table test_1(a bigint);
insert into test_1 select id from generate_series(1,3000000) gs(id);
insert into test_1 select id from generate_series(1,500000) gs(id);
analyze test_1;

--test_2 table, 2 identical columns, 10M rows, 1..1M, n_distinct 933k
create table test_2(a bigint, b bigint);
insert into test_2 select gs.id, gs.id from generate_series(1,1000000) 
gs(id), generate_series(1,10) gs2(id);
create index on test_2(b);
analyze test_2;

--test_3 table, 12.58M rows, 1..500k, 80k rows with same value, 
n_distinct 490k
create table test_3(a bigint);
insert into test_3 select gs.id from generate_series(1,500000) gs(id), 
generate_series(1,25) gs2(id);
insert into test_3 select 999 from generate_series(1,80000);
create index on test_3(a);
analyze test_3;


=========================================================================
set work_mem = '2MB';
explain (analyze, buffers) select test_2.a
from test_2 left join test_3 on test_3.a = test_2.a and (test_3.a + 0) = 
test_3.a
left join test_1 on test_1.a = test_2.a
where test_2.b < 21500;

~224k temp files generated

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
  Gather  (cost=280156.31..385334.66 rows=308388 width=8) (actual 
time=16342.410..35577.099 rows=12349500 loops=1)
    Workers Planned: 1
    Workers Launched: 1
    Buffers: shared hit=1077677 read=14901, temp read=245247 written=912164
    ->  Parallel Hash Right Join  (cost=279156.31..353495.86 rows=181405 
width=8) (actual time=16340.950..29837.671 rows=6174750 loops=2)
          Hash Cond: (test_1.a = test_2.a)
          Buffers: shared hit=1077677 read=14901, temp read=245247 
written=912164
          ->  Parallel Seq Scan on test_1  (cost=0.00..36075.24 
rows=2058824 width=8) (actual time=0.142..169.721 rows=1750000 loops=2)
                Buffers: shared hit=586 read=14901
          ->  Parallel Hash  (cost=276962.04..276962.04 rows=133701 
width=8) (actual time=4391.620..4391.624 rows=3087375 loops=2)
                Buckets: 131072 (originally 131072)  Batches: 65536 
(originally 4)  Memory Usage: 32352kB
                Buffers: shared hit=1077043, temp read=72723 written=387876
                ->  Nested Loop Left Join (cost=2814.38..276962.04 
rows=133701 width=8) (actual time=2.884..567.014 rows=3087375 loops=2)
                      Buffers: shared hit=1077043
                      ->  Parallel Bitmap Heap Scan on test_2 
(cost=2813.95..108618.25 rows=133701 width=8) (actual time=2.865..11.149 
rows=107495 loops=2)
                            Recheck Cond: (b < 21500)
                            Heap Blocks: exact=652
                            Buffers: shared hit=1422
                            ->  Bitmap Index Scan on test_2_b_idx  
(cost=0.00..2757.12 rows=227292 width=0) (actual time=3.901..3.901 
rows=214990 loops=1)
                                  Index Cond: (b < 21500)
                                  Buffers: shared hit=250
                      ->  Index Only Scan using test_3_a_idx on test_3  
(cost=0.43..1.25 rows=1 width=8) (actual time=0.001..0.003 rows=29 
loops=214990)
                            Index Cond: (a = test_2.a)
                            Filter: ((a + 0) = a)
                            Heap Fetches: 0
                            Buffers: shared hit=1075621
  Planning:
    Buffers: shared hit=8
  Planning Time: 0.576 ms
  Execution Time: 35907.097 ms

Unfortunately that plan is not very stable with different parameters and on
different versions of postgresql (I wasn't able to get it on machine with
enough resources to test quickly). Sometimes planner wants to hash join with
test_3 table instead of test_1 or join with test_1 table but in inner part.
With other plans it won't reproduce.

with work_mem = 4MB we get 141k files, with work_mem = 1MB we get 66k files.
more files = more memory consumed = bigger execution time.
with more than 700-1000k files system became unresponsive.

in that test I've got:
 >Batches: 65536 (originally 4)

with prod data largest number of batches I've seen:
 >Buckets: 131072 (originally 131072)  Batches: 262144 (originally 16)  
Memory Usage: 4736kB

Perhaps that helps somehow.

 >So, can Alexey apply a trivial patch and check whether the bug has gone?
yes, I think it's possible to test with prod data but I don't fully 
understand
how would you like to change given line.

Thanks,
Alexey Ermakov


On 2024-02-29 09:36, Andrei Lepikhov wrote:
> I found one possible weak point in the code of PHJ:
> ExecParallelHashJoinSetUpBatches:
>
> pstate->batches = dsa_allocate0(hashtable->area,
>     EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> It could explain why we have such a huge memory allocation with a size 
> not bonded to a power of 2.
> Also, it makes more sense that we already have a second report [1] 
> with the same value of DSA allocation.
> So, can Alexey apply a trivial patch and check whether the bug has gone?
>
> [1] 
>
https://www.postgresql.org/message-id/flat/PAYP264MB3368AF4D63704E370B204C39FF582%40PAYP264MB3368.FRAP264.PROD.OUTLOOK.COM
>




On 1/3/2024 16:50, Alexey Ermakov wrote:
> Hello! I reproduced generation of multiple temporary files and abnormal 
> memory
> usage (in 'top' output both backend and parallel worker used around 
> 3-4GB of
> RAM each, I often got OOM killer) but not memory allocation error so far.
Could you provide a backtrace at the moment of the error? It can show us 
exact place of the error without guesses.

-- 
regards,
Andrei Lepikhov
Postgres Professional




On 2024-03-01 18:48, Andrei Lepikhov wrote:
> On 1/3/2024 16:50, Alexey Ermakov wrote:
>> Hello! I reproduced generation of multiple temporary files and 
>> abnormal memory
>> usage (in 'top' output both backend and parallel worker used around 
>> 3-4GB of
>> RAM each, I often got OOM killer) but not memory allocation error so 
>> far.
> Could you provide a backtrace at the moment of the error? It can show 
> us exact place of the error without guesses.
>
yes, will do it on next week (it could take some time to get approvals 
to install gdb/debug packages or set up another host).

--

Alexey Ermakov




On Thu, Feb 29, 2024 at 4:37 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 21/2/2024 19:52, Tomas Vondra wrote:
> > It's a bit weird it needs 1.8GB of memory, but perhaps that's also
> > linked to the number of batches, somehow?

> I found one possible weak point in the code of PHJ:
> ExecParallelHashJoinSetUpBatches:
>
> pstate->batches = dsa_allocate0(hashtable->area,
>         EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> It could explain why we have such a huge memory allocation with a size
> not bonded to a power of 2.

Hmm, a couple of short term ideas:

1.  Maybe the planner should charge a high cost for exceeding some
soft limit on the expected number of batches; perhaps it could be
linked to the number of file descriptors we can open (something like
1000), because during the build phase we'll be opening and closing
random file descriptors like crazy due to vfd pressure, which is not
free; that should hopefully discourage the planner from reaching cases
like this, but of course only in cases the planner can predict.

2.  It sounds like we should clamp nbatches.  We don't want the
per-partition state to exceed MaxAllocSize, which I guess is what
happened here if the above-quoted line produced the error?  (The flag
that would allow "huge" allocations exceeding MaxAllocSize seems
likely to make the world worse while we have so many footguns -- I
think once a query has reached this stage of terrible execution, it's
better to limit the damage.)

The performance and memory usage will still be terrible.  We just
don't deal well with huge numbers of tiny partitions.  Tiny relative
to input size, with input size being effectively unbounded.

Thoughts for later:  A lower limit would of course be possible and
likely desirable.  Once the partition-bookkeeping memory exceeds
work_mem * hash_mem_multiplier, it becomes stupid to double it just
because a hash table has hit that size, because that actually
increases total memory usage (and fast, because it's quadratic).  We'd
be better off in practice giving up on the hash table size limit and
hoping for the best.  But the real long term question is what strategy
we're going to use to actually deal with this situation properly
*without* giving up our memory usage policies and hoping for the best,
and that remains an open question.  To summarise the two main ideas
put forward so far: (1) allow very high number of batches, but process
at most N of M batches at a time, using a temporary "all-the-rest"
batch to be re-partitioned to feed the next N batches + the rest in a
later cycle, (2) fall back to looping over batches multiple times in
order to keep nbatches <= a small limit while also not exceeding a
hash table size limit.  Both have some tricky edge cases, especially
with parallelism in the picture but probably even without it.  I'm
willing to work more on exploring this some time after the 17 cycle.




On 3/3/24 23:12, Thomas Munro wrote:
> On Thu, Feb 29, 2024 at 4:37 PM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> On 21/2/2024 19:52, Tomas Vondra wrote:
>>> It's a bit weird it needs 1.8GB of memory, but perhaps that's also
>>> linked to the number of batches, somehow?
> 
>> I found one possible weak point in the code of PHJ:
>> ExecParallelHashJoinSetUpBatches:
>>
>> pstate->batches = dsa_allocate0(hashtable->area,
>>         EstimateParallelHashJoinBatch(hashtable) * nbatch);
>>
>> It could explain why we have such a huge memory allocation with a size
>> not bonded to a power of 2.
> 

Maybe, but how many batches would that require for 1.8GB of memory? From
what I see in my experiments, this uses ~432B per batch, so that's be
about 4M batches. That's ... a lot. There may be some dependency on the
number of workers, but it's going to be in this ballpark.

OTOH I don't see any other dsa_allocate calls for the reproducer, so
maybe it really is this. I really wonder how many batches are there and
what's the memory per batch.

> Hmm, a couple of short term ideas:
> 
> 1.  Maybe the planner should charge a high cost for exceeding some
> soft limit on the expected number of batches; perhaps it could be
> linked to the number of file descriptors we can open (something like
> 1000), because during the build phase we'll be opening and closing
> random file descriptors like crazy due to vfd pressure, which is not
> free; that should hopefully discourage the planner from reaching cases
> like this, but of course only in cases the planner can predict.
> 

Will we really open the file descriptors like crazy? I think the 8kB
buffers we keep for temporary files is a pretty good protection against
that (but also one of the reasons for memory explosion in the serial
hash join). Or does the parallel version use files differently?

> 2.  It sounds like we should clamp nbatches.  We don't want the
> per-partition state to exceed MaxAllocSize, which I guess is what
> happened here if the above-quoted line produced the error?  (The flag
> that would allow "huge" allocations exceeding MaxAllocSize seems
> likely to make the world worse while we have so many footguns -- I
> think once a query has reached this stage of terrible execution, it's
> better to limit the damage.)
> 
> The performance and memory usage will still be terrible.  We just
> don't deal well with huge numbers of tiny partitions.  Tiny relative
> to input size, with input size being effectively unbounded.
> 

Yeah.

> Thoughts for later:  A lower limit would of course be possible and
> likely desirable.  Once the partition-bookkeeping memory exceeds
> work_mem * hash_mem_multiplier, it becomes stupid to double it just
> because a hash table has hit that size, because that actually
> increases total memory usage (and fast, because it's quadratic).

Right. This is pretty much exactly the reason for the memory explosion
in serial hash join, with a report maybe once or twice a year (who knows
how many people don't report, though).

> We'd be better off in practice giving up on the hash table size limit
> and hoping for the best.
Perhaps it'd be better not to give up entirely, but to relax it a bit.
One of the proposals in [1] (from ~5 years ago) was to "balance" the
hash table size and memory needed for the temporary files. That is, when
things go wrong, double the hash table size every time the the temp
files need the same amount of memory. That minimizes the total amount of
memory needed by the join, I think.

Back then it seemed like an ad hoc band-aid, but maybe it's better than
nothing. It was far simpler than various other ideas, and these failures
should be quite rare.

Perhaps a similar emergency approach would work here too?

> But the real long term question is what strategy
> we're going to use to actually deal with this situation properly
> *without* giving up our memory usage policies and hoping for the best,
> and that remains an open question.  To summarise the two main ideas
> put forward so far: (1) allow very high number of batches, but process
> at most N of M batches at a time, using a temporary "all-the-rest"
> batch to be re-partitioned to feed the next N batches + the rest in a
> later cycle, (2) fall back to looping over batches multiple times in
> order to keep nbatches <= a small limit while also not exceeding a
> hash table size limit.

I'm not sure about (2), but (1) sounds very much like a recursive hash
join, where we'd allow only a limited fan out at each stage. It's also a
bit like the "spill" files in my other proposal - that worked fine, and
it enforced the memory limit better than the memory balancing (which is
more like best-effort heuristics).

> Both have some tricky edge cases, especially
> with parallelism in the picture but probably even without it.  I'm
> willing to work more on exploring this some time after the 17 cycle.

I haven't thought very much about parallel hash joins, but what would be
the corner cases for non-parallel cases? Sure, it may increase the
amount of I/O, but only for cases with unexpectedly many batches (and
then it's just a natural trade-off I/O vs. enforcing memory limit).

regards

[1]
https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Mar 4, 2024 at 12:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 3/3/24 23:12, Thomas Munro wrote:
> > But the real long term question is what strategy
> > we're going to use to actually deal with this situation properly
> > *without* giving up our memory usage policies and hoping for the best,
> > and that remains an open question.  To summarise the two main ideas
> > put forward so far: (1) allow very high number of batches, but process
> > at most N of M batches at a time, using a temporary "all-the-rest"
> > batch to be re-partitioned to feed the next N batches + the rest in a
> > later cycle, (2) fall back to looping over batches multiple times in
> > order to keep nbatches <= a small limit while also not exceeding a
> > hash table size limit.
>
> I'm not sure about (2), but (1) sounds very much like a recursive hash
> join, where we'd allow only a limited fan out at each stage. It's also a
> bit like the "spill" files in my other proposal - that worked fine, and
> it enforced the memory limit better than the memory balancing (which is
> more like best-effort heuristics).

Yeah, I was trying to describe your spill idea in few words.  It's a good idea.

> > Both have some tricky edge cases, especially
> > with parallelism in the picture but probably even without it.  I'm
> > willing to work more on exploring this some time after the 17 cycle.
>
> I haven't thought very much about parallel hash joins, but what would be
> the corner cases for non-parallel cases? Sure, it may increase the
> amount of I/O, but only for cases with unexpectedly many batches (and
> then it's just a natural trade-off I/O vs. enforcing memory limit).

For parallel-specific problems I'd have to swap that back into my
brain... but I do remember one general problem with idea #1: if you
have too many duplicate keys, then anything based only on partition by
hash bits won't help, which was one of my objections to that idea and
the reason why I was talking about idea #2, which doesn't suffer from
that problem (but has other problems).  This morning I've been
wondering about a highly targeted solution to that: could we recognise
this condition cheaply, and then activate a special case that spills a
*bucket* to disk?  That is, a problem bucket that is indivisible by
hashing because all the hash values are the same.  Or something like
that.



On Mon, Mar 4, 2024 at 1:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ...  This morning I've been
> wondering about a highly targeted solution to that: could we recognise
> this condition cheaply, and then activate a special case that spills a
> *bucket* to disk?  That is, a problem bucket that is indivisible by
> hashing because all the hash values are the same.  Or something like
> that.

(One of the main problems being how to manage the match bits for outer
joins, a big problem also for idea #2.  But maybe it's somehow more
tractable there, I dunno.  I'd need vastly more coffee to have any
idea...)



On 2024-03-01 18:48, Andrei Lepikhov wrote:
> Could you provide a backtrace at the moment of the error? It can show 
> us exact place of the error without guesses.

Hello. Уes, you are right about ExecParallelHashJoinSetUpBatches. 
Backtrace looks like this:

========parallel worker
Breakpoint 1, 0x0000000000a2f1e0 in errfinish ()
(gdb) bt
#0  0x0000000000a2f1e0 in errfinish ()
#1  0x0000000000515d06 in dsa_allocate_extended.cold ()
#2  0x000000000071abdc in ExecParallelHashJoinSetUpBatches ()
#3  0x000000000071bd03 in ExecParallelHashIncreaseNumBatches ()
#4  0x000000000071c50d in ExecParallelHashTableInsert ()
#5  0x000000000071c6fa in MultiExecHash ()
#6  0x000000000071ecd1 in ExecParallelHashJoin.lto_priv.0 ()
#7  0x00000000006f8dc2 in standard_ExecutorRun ()
#8  0x00007f860667cf75 in pgss_ExecutorRun () from 
/usr/pgsql-16/lib/pg_stat_statements.so
#9  0x00007f8606271cad in pgsk_ExecutorRun () from 
/usr/pgsql-16/lib/pg_stat_kcache.so
#10 0x00000000006fdd36 in ParallelQueryMain ()
#11 0x00000000005b1ca2 in ParallelWorkerMain ()
#12 0x000000000083072c in StartBackgroundWorker ()
#13 0x000000000083a177 in maybe_start_bgworkers.lto_priv ()
#14 0x000000000083b7f4 in ServerLoop.lto_priv.0 ()
#15 0x0000000000833b6d in PostmasterMain ()
#16 0x000000000051c221 in main ()

=======backend
Breakpoint 1, 0x0000000000a2f1e0 in errfinish ()
(gdb) bt
#0  0x0000000000a2f1e0 in errfinish ()
#1  0x00000000005b12e8 in HandleParallelMessages ()
#2  0x00000000008c74f5 in ProcessInterrupts.part.0 ()
#3  0x00000000008b0715 in ConditionVariableTimedSleep.part.0 ()
#4  0x000000000089ea85 in BarrierArriveAndWait ()
#5  0x000000000071b9e7 in ExecParallelHashIncreaseNumBatches ()
#6  0x000000000071c255 in ExecParallelHashTupleAlloc ()
#7  0x000000000071c449 in ExecParallelHashTableInsert ()
#8  0x000000000071c6fa in MultiExecHash ()
#9  0x000000000071ecd1 in ExecParallelHashJoin.lto_priv.0 ()
#10 0x0000000000718124 in ExecGather ()
#11 0x00000000006f8dc2 in standard_ExecutorRun ()
#12 0x00007f860667cf75 in pgss_ExecutorRun () from 
/usr/pgsql-16/lib/pg_stat_statements.so
#13 0x00007f8606271cad in pgsk_ExecutorRun () from 
/usr/pgsql-16/lib/pg_stat_kcache.so
#14 0x00000000008c8ba6 in PortalRunSelect ()
#15 0x00000000008ca556 in PortalRun ()
#16 0x00000000008cb5e0 in exec_simple_query ()
#17 0x00000000008cd96f in PostgresMain ()
#18 0x000000000083d174 in ServerLoop.lto_priv.0 ()
#19 0x0000000000833b6d in PostmasterMain ()
#20 0x000000000051c221 in main ()

Would you like to get any additional info from gdb (like number of 
batches) ?

--

Thanks,

Alexey Ermakov




On 11/3/2024 15:12, Alexey Ermakov wrote:
> Would you like to get any additional info from gdb (like number of 
> batches) ?

Can you try to reproduce the case with simple patch attached?

-- 
regards,
Andrei Lepikhov
Postgres Professional

Вложения