Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Дата
Msg-id 556ea57a-9cf3-4288-936e-bbe63f2bd29e@enterprisedb.com
обсуждение исходный текст
Ответ на Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-bugs

On 3/3/24 23:12, Thomas Munro wrote:
> On Thu, Feb 29, 2024 at 4:37 PM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> On 21/2/2024 19:52, Tomas Vondra wrote:
>>> It's a bit weird it needs 1.8GB of memory, but perhaps that's also
>>> linked to the number of batches, somehow?
> 
>> I found one possible weak point in the code of PHJ:
>> ExecParallelHashJoinSetUpBatches:
>>
>> pstate->batches = dsa_allocate0(hashtable->area,
>>         EstimateParallelHashJoinBatch(hashtable) * nbatch);
>>
>> It could explain why we have such a huge memory allocation with a size
>> not bonded to a power of 2.
> 

Maybe, but how many batches would that require for 1.8GB of memory? From
what I see in my experiments, this uses ~432B per batch, so that's be
about 4M batches. That's ... a lot. There may be some dependency on the
number of workers, but it's going to be in this ballpark.

OTOH I don't see any other dsa_allocate calls for the reproducer, so
maybe it really is this. I really wonder how many batches are there and
what's the memory per batch.

> Hmm, a couple of short term ideas:
> 
> 1.  Maybe the planner should charge a high cost for exceeding some
> soft limit on the expected number of batches; perhaps it could be
> linked to the number of file descriptors we can open (something like
> 1000), because during the build phase we'll be opening and closing
> random file descriptors like crazy due to vfd pressure, which is not
> free; that should hopefully discourage the planner from reaching cases
> like this, but of course only in cases the planner can predict.
> 

Will we really open the file descriptors like crazy? I think the 8kB
buffers we keep for temporary files is a pretty good protection against
that (but also one of the reasons for memory explosion in the serial
hash join). Or does the parallel version use files differently?

> 2.  It sounds like we should clamp nbatches.  We don't want the
> per-partition state to exceed MaxAllocSize, which I guess is what
> happened here if the above-quoted line produced the error?  (The flag
> that would allow "huge" allocations exceeding MaxAllocSize seems
> likely to make the world worse while we have so many footguns -- I
> think once a query has reached this stage of terrible execution, it's
> better to limit the damage.)
> 
> The performance and memory usage will still be terrible.  We just
> don't deal well with huge numbers of tiny partitions.  Tiny relative
> to input size, with input size being effectively unbounded.
> 

Yeah.

> Thoughts for later:  A lower limit would of course be possible and
> likely desirable.  Once the partition-bookkeeping memory exceeds
> work_mem * hash_mem_multiplier, it becomes stupid to double it just
> because a hash table has hit that size, because that actually
> increases total memory usage (and fast, because it's quadratic).

Right. This is pretty much exactly the reason for the memory explosion
in serial hash join, with a report maybe once or twice a year (who knows
how many people don't report, though).

> We'd be better off in practice giving up on the hash table size limit
> and hoping for the best.
Perhaps it'd be better not to give up entirely, but to relax it a bit.
One of the proposals in [1] (from ~5 years ago) was to "balance" the
hash table size and memory needed for the temporary files. That is, when
things go wrong, double the hash table size every time the the temp
files need the same amount of memory. That minimizes the total amount of
memory needed by the join, I think.

Back then it seemed like an ad hoc band-aid, but maybe it's better than
nothing. It was far simpler than various other ideas, and these failures
should be quite rare.

Perhaps a similar emergency approach would work here too?

> But the real long term question is what strategy
> we're going to use to actually deal with this situation properly
> *without* giving up our memory usage policies and hoping for the best,
> and that remains an open question.  To summarise the two main ideas
> put forward so far: (1) allow very high number of batches, but process
> at most N of M batches at a time, using a temporary "all-the-rest"
> batch to be re-partitioned to feed the next N batches + the rest in a
> later cycle, (2) fall back to looping over batches multiple times in
> order to keep nbatches <= a small limit while also not exceeding a
> hash table size limit.

I'm not sure about (2), but (1) sounds very much like a recursive hash
join, where we'd allow only a limited fan out at each stage. It's also a
bit like the "spill" files in my other proposal - that worked fine, and
it enforced the memory limit better than the memory balancing (which is
more like best-effort heuristics).

> Both have some tricky edge cases, especially
> with parallelism in the picture but probably even without it.  I'm
> willing to work more on exploring this some time after the 17 cycle.

I haven't thought very much about parallel hash joins, but what would be
the corner cases for non-parallel cases? Sure, it may increase the
amount of I/O, but only for cases with unexpectedly many batches (and
then it's just a natural trade-off I/O vs. enforcing memory limit).

regards

[1]
https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: FSM Corruption (was: Could not read block at end of the relation)
Следующее
От: jian he
Дата:
Сообщение: Re: Record returning function accept not matched columns declaration