Bufferless buffered files
От | Thomas Munro |
---|---|
Тема | Bufferless buffered files |
Дата | |
Msg-id | CA+hUKGKCnU9NjFfzO219V-YeyWr8mZe4JRrf=x_uv6qsePBcOw@mail.gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
"$SUBJECT" may sound like a Zen kōan, but buffile.c does a few useful things other than buffering (I/O time tracking, inter-process file exchange though file sets, segmentation, etc). A couple of places such as logtape.c, sharedtuplestore.c and gistbuildbuffers.c do block-oriented I/O in temporary files, but unintentionally also pay the price of buffile.c's buffering scheme for no good reason, namely: * extra memory for the buffer * extra memcpy() to/from that buffer * I/O chopped up into BLCKSZ-sized system calls Melanie and I were bouncing some ideas around about the hash join batch memory usage problem, and I thought it might be good to write down these thoughts about lower level buffer management stuff in a new thread. They don't address the hard algorithmic issues in that topic (namely: at some point, doubling the number of partition files will take more buffer space than can be saved by halving the size of the hash table, parallel or not), but we probably should remove a couple of the constants that make the parallel case worse. As a side benefit it would be nice to also fix some obvious silly efficiency problems for all users of BufFiles. First, here is a toy patch to try to skip the buffering layer in BufFile when possible, without changing any interfaces. Other ways would of course be possible, including a new separate module with a different name. The approach taken here is to defer allocation of BufFile::buffer, and use a fast path that just passes I/O straight through to the OS without touching the sides, until the first non-block-aligned I/O is seen, at which point we switch to the less efficient mode. Don't take this code too seriously, and I'm not sure what to think about short reads/writes yet; it's just a cheap demo to accompany this problem statement. We could go further (and this one is probably more important). SharedTuplestore works in 32KB chunks, which are the "parallel grain" for distributing data to workers that read the data back later (usually the workers try to read from different files to avoid stepping on each others' toes, but when there aren't enough files left they gang up and read from one file in parallel, and we need some way to scatter the data to different workers; these chunks are analogous to the heap table blocks handed out by Parallel Seq Scan). With the bufferless-BufFile optimisation, data is written out directly from sharedtuplestore.c's buffer to the OS in a 32KB pwrite(), which is a nice improvement, but that's 32KB of buffer space that non-parallel HJ doesn't have. We could fix that by making the chunk size dynamic, and turning it down as low as 8KB when the number of partitions shoots up to silly numbers by some heuristics. (I guess the buffer size and the chunk size don't have to be strictly the same, but it'd probably make the code simpler...) With those changes, PHJ would use strictly no more buffer memory than HJ, unless npartitions is fairly low when it should be OK to do so (and we could even consider making it larger sometimes for bigger I/Os). (Obviously it's just terrible in every way when we have very high numbers of partitions, and this just mitigates one symptom. For example, once we get to our max fds limit, we're re-opening and re-closing files at high speed, which is bonkers, and the kernel has an identical buffer space management problem on its side of the wall. In the past we've talked about something a bit more logtape.c-like (multiple streams in one file with our own extent management) which might help a bit. But... it feels like what we really need is a higher level algorithmic solution that lets us cap the number of partitions, as chronicled elsewhere.) We could go further again. The above concerns the phase where we're writing data out to many partitions, which requires us to have in-memory state for all of them. But during the reading back phase, we read from just a single partition at a time, so we don't have the memory explosion problem. For dumb implementation reasons (essentially because it was styled on the older pre-parallel HJ code), SharedTuplestore reads one tuple at a time from buffile.c with BufFileRead(), so while reading it *needs* BufFile's internal buffering, which means 8KB pread() calls, no optimisation for you. We could change it to read in whole 32KB chunks at a time (or dynamic size as mentioned already), via the above faster bufferless path. The next benefit would be that, with whole chunks in *its* buffer, it could emit MinimalTuples using a pointer directly into that buffer, skipping another memcpy() (unless the tuple is too big and involves overflow chunks, a rare case). (Perhaps there could even be some clever way to read whole chunks directly into the hash table memory chunks at once, instead of copying tuples in one-by-one, not examined.) Another closely related topic that came up in other recent I/O work: How can we arrange for all of these buffers to be allocated more efficiently? In fact we know how many partitions there are at a higher level, and we'd ideally make one palloc_aligned(PG_IO_ALIGN_SIZE) for all of them, instead of many individual smaller pallocs. Aside from wasted CPU cycles, individual pallocs have invisible space overheads at both libc and palloc level, and non-I/O-aligned buffers have invisible costs at I/O system call level (the kernel must hold/pin/something an extra neighbouring VM page while copying in/out of user space, if not nicely aligned). However, palloc_aligned()'s strategy is to over-allocate by 4KB and then round the pointer up, so we'd ideally like to do that just once for a whole array, not once for each BufFile (you can see this problem in the attached patch, which does that that for individual BufFile objects, but it'd make the problem Jehan-Guillaume wrote about slightly worse, which is why I didn't do that in commit faeedbcef "Introduce PG_IO_ALIGN_SIZE and align all I/O buffers."). This seems to suggest a different design where the client code can optionally provide its own buffer space, perhaps something as simple as BufFileSetBuffer(), and we'd need to make one big allocation and pass in pointers to its elements for the raw batch files (for HJ), and come up with something similar for SharedTuplestore's internal chunk buffer (for PHJ)).
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: Tatsuo IshiiДата:
Сообщение: Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options