Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Дата
Msg-id CAAaqYe8XAVk408rowxxUpSNMVXx=fpK-rRKGzwkuHBaCcv3jRQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-bugs
On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >The following bug has been logged on the website:
> >
> >Bug reference:      16104
> >Logged by:          James Coleman
> >Email address:      jtc331@gmail.com
> >PostgreSQL version: 11.5
> >Operating system:   Debian
> >Description:
> >
> >We have a query that, after a recent logical migration to 11.5, ends up with
> >a parallel hash join (I don't think the query plan/query itself is important
> >here, but if needed after the rest of the explanation, I can try to redact
> >it for posting). The query results in this error:
> >
> >ERROR:  invalid DSA memory alloc request size 1375731712
> >
> >(the size changes sometimes significantly, but always over a GB)
> >
> >At first glance it sounded eerily similar to this report which preceded the
> >final release of 11.0:
>
>https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
> >but I confirmed that the patch for that bug was applied and is in 11.5 (and
> >earlier).
> >
> >We managed to reproduce this on a replica, and so were able to attach gdb in
> >production to capture a backtrace:
> >
> >#0  errfinish (dummy=dummy@entry=0) at
> >./build/../src/backend/utils/error/elog.c:423
> >#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
> >fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
> >./build/../src/backend/utils/error/elog.c:1385
> >#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
> >size=1140850688, flags=flags@entry=4) at
> >./build/../src/backend/utils/mmgr/dsa.c:677
> >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >./build/../src/backend/executor/nodeHash.c:2889
> > ...
>
> I've briefly looked at this today, and I think the root cause is
> somewhat similar to what is described in [1] where we simply increase
> the number of batches in an effort to keep batch contents in work_mem,
> but ignoring that each batch requires quite a bit of memory. So we end
> up with a lot of batches where each is small enough to fit into
> work_mem, but we need much more than work_mem to track the batches.
>
> This seems to be about the same problem, except that instead of
> forgeting about BufFile, the parallel hash join ignores this:
>
>    pstate->batches =
>      dsa_allocate0(hashtable->area,
>                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> Looking at the backtrace, you ended up with nbatch=2097152 and it failed
> trying to allocate 1140850688 bytes, i.e. exactly 544B per batch.
>
> The EstimateParallelHashJoinBatch combines a bunch of structs and does
> depend on the number of participants, and by my rough calculation 544B
> means ~13 participants. Any idea if that matches the number of parallel
> workers for this query?

We currently have max_parallel_workers_per_gather = 6, which is where
the repro happened with the backtrace. We initially had the problem
with max_parallel_workers_per_gather = 2. I also just confirmed an
explain shows Workers Planned: 6.

> Can you try running the query with fewer workers per gather?

Since it failed initially with 2, I think we've effectively tested
this already (I assume that's a significant enough change to be
expected to solve it).

> IMHO this has the same issue as the non-parallel case described in [1]
> in that we don't really consider this memory when planning the query,
> which means we may accept hash join for queries where we know the number
> of batches will be too high during planning. We need to reject parallel
> hash join in those cases.
>
> Of course, I don't know if this is such case - perhaps this is a case of
> underestimation and we only learn the actual number while executing the
> query, at which point it's too late to reject PHJ. I suppose the right
> solution in that case would be some form of BNL, as discussed in [2] but
> I'm not sure how far are we from having that, or if that works for PHJ.
>
>
> regards
>
> [1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development
>
> [2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash