Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
От | Thomas Munro |
---|---|
Тема | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Дата | |
Msg-id | CA+hUKGJFNg0XqXiBNzOeTPsCa4tb-a00Sv5EA7jD+w26ATyUyA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (James Coleman <jtc331@gmail.com>) |
Ответы |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
(James Coleman <jtc331@gmail.com>)
|
Список | pgsql-bugs |
On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331@gmail.com> wrote: > In this case though the failure in some sense seems fairly artificial. > Aside from the query being slow, there doesn't appear to be any real > limitation on the query completing. The box in question has 768GB of > memory, so limiting this memory structure to 1GB seems artificial. Yeah, completely artificial. We limit ourselves to MaxAllocSize, a self-imposed limit that applies to palloc() and dsa_allocate(), probably dating back to a time in history when anyone asking for more than that must surely be out of their mind. Then there are a few places where we allow ourselves to go past that limit by passing in an extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep, that's the bitmap heap scan page table, and anything using simplehash.h's default allocator (so I guess that includes hash agg; that's interesting, we can do a 64GB-bucket-array hash agg but not hash join). As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a couple of artificial constraints on hash joins: the limit on the number of hash join buckets which comes entirely from using 'int' as a data type for bucket numbers (an anachronism from pre-standard C or maybe just 32 bit-ism: the correct type is surely size_t, which is by definition big enough for any array that you could possibly address), and the MaxAllocSize thing. Easy to fix, of course. I noticed that when working on PHJ and decided to keep the same restriction for the new parallel code paths, because (1) it seemed like a policy choice we should make and then apply to both, and (2) it does provide some kind of sanity check, though it's increasingly looking overdue to be removed (in other words: previously I was only speculating about people showing up with ~1TB RAM machines and running into these ancient limits, but ... here we are). For example, suppose we added the DSA_ALLOC_HUGE flag to the line that is failing in your case. Now it would be able to blow through that 1GB limit, but where would it stop? Until we know how you're reaching this state, it's hard to know whether it'd go to (say) 2GB, and then work perfectly, job done, or whether it'd keep going until it ate all your RAM and made your users really unhappy. I think this must be a case of extreme skew, as complained about in[1]. Let's see... you have ~6 billion rows, and you said the planner knew that (it estimated about a billion, when there were 6 workers, so it's in the ball park). You didn't say how many batches the planner planned for. Let's see if I can guess... 256 or 512? That'd allow for 6 billion * 16 byte rows + some slop, chopped up into a power-of-two number of partitions that fit inside 500MB. Then at execution time, they didn't fit, and we went into repartition-until-it-fits mode. At some point we tried to cut them into ~2 million partitions and hit this error. That'd be a paltry 3,000 rows per partition if the keys were uniformly distributed, but it thinks that at least one partition is not fitting into 500MB. Conclusion: unless this is a carefully crafted hash attack, there must be one particular key has more than 500MB worth of rows, but also a smattering of other keys that fall into the same partition, that are causing us to keep trying to repartition until it eventually squeezes all of them all in the same direction during a split (this requires repeatedly splitting partitions until you reach one partition per tuple!). So I'm wondering if this would be fixed by, say, a 95% threshold (instead of 100%) for the extreme skew detector, as I proposed in a patch in the first email in that thread that later veered off into the BNL discussion[1]. Unfortunately that patch only deals with non-parallel HJ, but a PHJ version should be easy. And if not by 95%, I wonder what threshold would be needed for this case, and what kind of maths to use to think about it. If I wrote a patch like [1] with PHJ support, would you be able to test it on a copy of your workload? [1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
В списке pgsql-bugs по дате отправления:
Предыдущее
От: James ColemanДата:
Сообщение: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Следующее
От: Thomas MunroДата:
Сообщение: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash