Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Дата
Msg-id CAAaqYe-VHx2O1k1x-wML3_1GHCPajugvMify=a_4cq-aHb=ZVg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-bugs
On Sat, Nov 9, 2019 at 4:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331@gmail.com> wrote:
> > In this case though the failure in some sense seems fairly artificial.
> > Aside from the query being slow, there doesn't appear to be any real
> > limitation on the query completing. The box in question has 768GB of
> > memory, so limiting this memory structure to 1GB seems artificial.
>
> Yeah, completely artificial.  We limit ourselves to MaxAllocSize, a
> self-imposed limit that applies to palloc() and dsa_allocate(),
> probably dating back to a time in history when anyone asking for more
> than that must surely be out of their mind.  Then there are a few
> places where we allow ourselves to go past that limit by passing in an
> extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep,
> that's the bitmap heap scan page table, and anything using
> simplehash.h's default allocator (so I guess that includes hash agg;
> that's interesting, we can do a 64GB-bucket-array hash agg but not
> hash join).
>
> As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a
> couple of artificial constraints on hash joins: the limit on the
> number of hash join buckets which comes entirely from using 'int' as a
> data type for bucket numbers (an anachronism from pre-standard C or
> maybe just 32 bit-ism: the correct type is surely size_t, which is by
> definition big enough for any array that you could possibly address),
> and the MaxAllocSize thing.  Easy to fix, of course.  I noticed that
> when working on PHJ and decided to keep the same restriction for the
> new parallel code paths, because (1) it seemed like a policy choice we
> should make and then apply to both, and (2) it does provide some kind
> of sanity check, though it's increasingly looking overdue to be
> removed (in other words: previously I was only speculating about
> people showing up with ~1TB RAM machines and running into these
> ancient limits, but ... here we are).

Heh. We've run with some clusters at 2TB of memory for many years, but
I also realize that's not all that common.

> For example, suppose we added the DSA_ALLOC_HUGE flag to the line that
> is failing in your case.  Now it would be able to blow through that
> 1GB limit, but where would it stop?  Until we know how you're reaching
> this state, it's hard to know whether it'd go to (say) 2GB, and then
> work perfectly, job done, or whether it'd keep going until it ate all
> your RAM and made your users really unhappy.

Yeah, I think it's definitely non-obvious exactly what to do. In
general I think preferring not failing queries is good, but on the
other hand that could result in OOM-killing a bunch of other things
too and/or failing other queries because they can't alloc.

> I think this must be a case of extreme skew, as complained about
> in[1].  Let's see... you have ~6 billion rows, and you said the
> planner knew that (it estimated about a billion, when there were 6
> workers, so it's in the ball park).  You didn't say how many batches
> the planner planned for.  Let's see if I can guess... 256 or 512?

I ran a successful one (this run happened to have max parallel workers
per gather set to 8, and used 8 workers) at 150MB work_mem, and the
parallel hash node had these stats:

Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
4096)  Memory Usage: 344448kB
Buffers: shared hit=38360 read=142422702, temp read=45766294 written=76912272

I'm not sure how to captures the batch information on a failing run,
since you only get it on ANALYZE, but the query doesn't complete so
you can't get that. I do notice in the original bug report stack trace
though that:

#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
./build/../src/backend/executor/nodeHash.c:2889

which I think tells us the number of batches?

That buffers written is ~615GB which corresponds to one blip on the
disk graphs for this run (that was around 600GB) and the HashAggregate
node at the top of the query (which is only grouping on
items.account_id; see below, but this has significant skew) had:

Buffers: shared hit=608075044 read=142422702, temp read=77222547
written=77284432

which corresponds to the second disk graph blip of similar size.

Side note about explain output: it seems that the hash aggregate node
appears to not output bucket stats. I've also noticed that the nested
loop joins appear to not output the join condition, but hash joins
output the hash condition, and if I remember correctly, merge joins
output the join condition. If I'm right about that...I guess it means
another thing on the list of potential patches. Seems like it'd make
understanding plans easier.

> That'd allow for 6 billion * 16 byte rows + some slop, chopped up into
> a power-of-two number of partitions that fit inside 500MB.  Then at
> execution time, they didn't fit, and we went into
> repartition-until-it-fits mode.  At some point we tried to cut them
> into ~2 million partitions and hit this error.  That'd be a paltry
> 3,000 rows per partition if the keys were uniformly distributed, but
> it thinks that at least one partition is not fitting into 500MB.
> Conclusion: unless this is a carefully crafted hash attack, there must
> be one particular key has more than 500MB worth of rows, but also a
> smattering of other keys that fall into the same partition, that are
> causing us to keep trying to repartition until it eventually squeezes
> all of them all in the same direction during a split (this requires
> repeatedly splitting partitions until you reach one partition per
> tuple!).  So I'm wondering if this would be fixed by, say, a 95%
> threshold (instead of 100%) for the extreme skew detector, as I
> proposed in a patch in the first email in that thread that later
> veered off into the BNL discussion[1].  Unfortunately that patch only
> deals with non-parallel HJ, but a PHJ version should be easy.  And if
> not by 95%, I wonder what threshold would be needed for this case, and
> what kind of maths to use to think about it.

So I should have run the earlier attached plan with VERBOSE, but
here's the interesting thing: the parallel hash node's seq scan node
outputs two columns: let's call them (from the redacted plan)
items.system_id and items.account_id. The first (system_id) is both
not null and unique; the second (account_id) definitely has massive
skew. I'm not very up-to-speed on how the hash building works, but I
would have (perhaps naïvely?) assumed that the first column being
unique would make the hash keys very likely not to collide in any
significantly skewed way. Am I missing something here?

> If I wrote a patch like
> [1] with PHJ support, would you be able to test it on a copy of your
> workload?

I think that would be a significant effort to pull off, but _maybe_
not entire impossible. Certainly quite difficult though.

> [1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Следующее
От: James Coleman
Дата:
Сообщение: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash