Re: Wrong hash table size calculation in Parallel Hash Join
От | Thomas Munro |
---|---|
Тема | Re: Wrong hash table size calculation in Parallel Hash Join |
Дата | |
Msg-id | CA+hUKGKrVDRRkz9N96BvWE+mJhS-+vBTbQBvxVTLuMVH2YUBMw@mail.gmail.com обсуждение исходный текст |
Ответ на | Wrong hash table size calculation in Parallel Hash Join (Andrey Lepikhov <a.lepikhov@postgrespro.ru>) |
Список | pgsql-bugs |
On Sat, Jan 18, 2020 at 9:41 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > While running the Join Order Benchmark [1] test on my laptop with 8GB > RAM I saw strange situation where linux (Ubuntu 18, PostgreSQL master > branch) killed a backend (or a parallel worker in some cases) with signal 9. > Dmesg showed an error: > [176313.266426] Out of memory: Kill process 16011 (postgres) score 135 > or sacrifice child > [176313.266432] Killed process 16011 (postgres) total-vm:1602320kB, > anon-rss:1325908kB, file-rss:1136kB, shmem-rss:44612kB > > I spend time investigating and saw the problem. > Calculation of hash table size based on GUC "work_mem". But if we have > huge relation, we divides it into the batches, stored into a shared > tuplestores. At the first write into the batch, tuplestore allocates > memory buffer: > > accessor->write_chunk = (SharedTuplestoreChunk *) > MemoryContextAllocZero(accessor->context, STS_CHUNK_PAGES * BLCKSZ); > > if we have many batches, we have many additional memory allocations. In > my case work_mem=64MB, nbatches=65000 and we need about 2GB of > additional memory. Hi Andrey, You're right. There are several things that we don't count, when doing memory budgeting computations, and some them grow in proportion to the number of batches. Here are three of them: 1. There's an array of ParallelHashJoinBatch objects in shared memory. 2. There's a buffer inside every SharedTuplestoreAccessor (write_chunk). 3. There's a buffer inside every BufFile. Point 3 applies to non-parallel hash join too. The buffers are bigger problems, just because they're bigger, especially the one you highlighted. The double-buffering there could be fixed (SharedTuplestoreAccessor doesn't care about BufFile's buffering, it mostly wants it segmentation logic). I don't think that's the only problem with having large numbers of batches though. It generates about the worst access pattern imaginable for the OS and storage. If you set work_mem small enough, you can finish up writing randomly to many millions of files, but having to reopen them every time due to max_files_per_process, while the kernel keeps booting the inodes out of memory and flushing small unmergeable I/Os all over the device. At some point, it must be better to do multiple passes (that's what some other systems do when partitioning hash joins in memory just to avoid (memory) cache misses, but partitioning to disk files is another thing). Perhaps you'd partition the data into (say) 1024 batches + one 'overflow' batch for all the rest, and then partition the rest into the next 1024 batches + the rest and so on (Tomas Vondra proposed something like this). Another idea is that you'd limit the number of batches to a sane number and switch to a kind of nested loop hash join, where you load fragments of the hash table at a time. The advantage of the looping approach is that it can handle some cases that the overflow idea can't, but on the other hand its complexity is worse, it's a kind of nested loop. There are some links to threads about these topics here (and I've added your report as a link): https://wiki.postgresql.org/wiki/Hash_Join > I don't specialized in the Parallel Hash Join code before, but > considering the code, more accurate calculation of the size of hash > table will be based on the quadratic equation, something like this: > M_h = M+sqrt(M^2 -M_b*RelSize)/2 > > here M - work_mem GUC; M_h - estimation of the hash table size; RelSize > - size of relation in bytes; M_b=4*STS_CHUNK_PAGES * BLCKSZ. > > In the case of M^2 < 4*STS_CHUNK_PAGES * BLCKSZ we need to increase > work_mem locally. > > As some sketch of the solution i prepared the patch (see in attachment). > If this is a significant problem, I'm ready to continue the solution. We can discuss how to do the accounting maths, but the real question is: what do you actually want to do when you run out of memory? At some point you can't create any more batches because that takes more memory itself, so you have to switch to some new strategy, or give up on work_mem. If you have to give up on work_mem as soon as you hit this condition anyway, then you'll still be able to get an OOM SIGKILL. With the "overflow" idea, you can at least add new batches "free" (in terms of memory for bookkeeping), though you can't deal with indivisible batches (extreme skew). With the "loop" idea, you can stop adding new batches and you have a new way to divide indivisible batches.
В списке pgsql-bugs по дате отправления:
Следующее
От: PG Bug reporting formДата:
Сообщение: BUG #16221: ERROR while importing Plpython , pltcl