Обсуждение: Potential stack overflow in incremental base backup
Hi Robert, I was rebasing my patch to convert RELSEG_SIZE into an initdb-time setting, and thus runtime variable, and I noticed this stack variable in src/backend/backup/basebackup_incremental.c: GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path, BlockNumber *relative_block_numbers, unsigned *truncation_block_length) { BlockNumber absolute_block_numbers[RELSEG_SIZE]; I'll have to move that sucker onto the heap (we banned C99 variable length arrays and we don't use nonstandard alloca()), but I think the coding in master is already a bit dodgy: that's 131072 * sizeof(BlockNumber) = 512kB with default configure options, but --with-segsize X could create a stack variable up to 16GB, corresponding to segment size 32TB (meaning no segmentation at all). That wouldn't work. Shouldn't we move it to the stack? See attached draft patch. Even on the heap, 16GB is too much to assume we can allocate during a base backup. I don't claim that's a real-world problem for incremental backup right now in master, because I don't have any evidence that anyone ever really uses --with-segsize (do they?), but if we make it an initdb option it will be more popular and this will become a problem. Hmm.
Вложения
On 2024-Mar-06, Thomas Munro wrote: > Even on the heap, 16GB is too much to assume we can allocate during a > base backup. I don't claim that's a real-world problem for > incremental backup right now in master, because I don't have any > evidence that anyone ever really uses --with-segsize (do they?), but > if we make it an initdb option it will be more popular and this will > become a problem. Hmm. Would it work to use a radix tree from the patchset at https://postgr.es/m/CANWCAZb43ZNRK03bzftnVRAfHzNGzH26sjc0Ep-sj8+w20VzSg@mail.gmail.com ? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "E pur si muove" (Galileo Galilei)
On Wed, Mar 6, 2024 at 12:44 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I'll have to move that sucker onto the heap (we banned C99 variable > length arrays and we don't use nonstandard alloca()), but I think the > coding in master is already a bit dodgy: that's 131072 * > sizeof(BlockNumber) = 512kB with default configure options, but > --with-segsize X could create a stack variable up to 16GB, > corresponding to segment size 32TB (meaning no segmentation at all). > That wouldn't work. Shouldn't we move it to the stack? See attached > draft patch. > > Even on the heap, 16GB is too much to assume we can allocate during a > base backup. I don't claim that's a real-world problem for > incremental backup right now in master, because I don't have any > evidence that anyone ever really uses --with-segsize (do they?), but > if we make it an initdb option it will be more popular and this will > become a problem. Hmm. I don't see a problem with moving it from the stack to the heap. I don't believe I had any particular reason for wanting it on the stack specifically. I'm not sure what to do about really large segment sizes. As long as the allocation here is < ~100MB, it's probably fine, both because (1) 100MB isn't an enormously large allocation these days and (2) if you have a good reason to increase the segment size by a factor of 256 or more, you're probably running on a big machine, and then you should definitely have 100MB to burn. However, a 32TB segment size is another thing altogether. We could avoid transposing relative block numbers to absolute block numbers whenever start_blkno is 0, but that doesn't do a whole lot when the segment size is 8TB or 16TB rather than 32TB; plus, in such cases, the relative block number array is also going to be gigantic. Maybe what we need to do is figure out how to dynamically size the arrays in such cases, so that you only make them as big as required for the file you actually have, rather than as big as the file could theoretically be. But I think that's really only necessary if we're actually going to get rid of the idea of segmented relations altogether, which I don't think is happening at least for v17, and maybe not ever. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 6, 2024 at 6:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2024-Mar-06, Thomas Munro wrote: > > Even on the heap, 16GB is too much to assume we can allocate during a > > base backup. I don't claim that's a real-world problem for > > incremental backup right now in master, because I don't have any > > evidence that anyone ever really uses --with-segsize (do they?), but > > if we make it an initdb option it will be more popular and this will > > become a problem. Hmm. > > Would it work to use a radix tree from the patchset at > https://postgr.es/m/CANWCAZb43ZNRK03bzftnVRAfHzNGzH26sjc0Ep-sj8+w20VzSg@mail.gmail.com > ? Probably not that much, because we actually send the array to the client very soon after we construct it: push_to_sink(sink, &checksum_ctx, &header_bytes_done, incremental_blocks, sizeof(BlockNumber) * num_incremental_blocks); This is hard to do without materializing the array somewhere, so I don't think an alternate representation is the way to go in this instance. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 8, 2024 at 6:53 AM Robert Haas <robertmhaas@gmail.com> wrote: > But I think that's really only necessary if we're actually going to > get rid of the idea of segmented relations altogether, which I don't > think is happening at least for v17, and maybe not ever. Yeah, I consider the feedback on ext4's size limitations to have completely killed the idea of getting rid of segments for the foreseeable future, at least in standard md.c (though who knows what people will do with pluggable smgr?). As for initdb --rel-segsize (CF #4305) for md.c, I abandoned plans to do that for 17 because I couldn't see what to do about this issue. Incremental backup effectively relies on smaller segments, by using them as problem-dividing granularity for checksumming and memory usage. That'll take some research...
On Fri, Mar 8, 2024 at 6:53 AM Robert Haas <robertmhaas@gmail.com> wrote: > ... We could > avoid transposing relative block numbers to absolute block numbers > whenever start_blkno is 0, ... Could we just write the blocks directly into the output array, and then transpose them directly in place if start_blkno > 0? See attached. I may be missing something, but the only downside I can think of is that the output array is still clobbered even if we decide to return BACK_UP_FILE_FULLY because of the 90% rule, but that just requires a warning in the comment at the top.
Вложения
On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Could we just write the blocks directly into the output array, and > then transpose them directly in place if start_blkno > 0? See > attached. I may be missing something, but the only downside I can > think of is that the output array is still clobbered even if we decide > to return BACK_UP_FILE_FULLY because of the 90% rule, but that just > requires a warning in the comment at the top. Yeah. This approach makes the name "relative_block_numbers" a bit confusing, but not running out of memory is nice, too. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Apr 11, 2024 at 12:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Could we just write the blocks directly into the output array, and > > then transpose them directly in place if start_blkno > 0? See > > attached. I may be missing something, but the only downside I can > > think of is that the output array is still clobbered even if we decide > > to return BACK_UP_FILE_FULLY because of the 90% rule, but that just > > requires a warning in the comment at the top. > > Yeah. This approach makes the name "relative_block_numbers" a bit > confusing, but not running out of memory is nice, too. Pushed. That fixes the stack problem. Of course we still have this: /* * Since this array is relatively large, avoid putting it on the stack. * But we don't need it at all if this is not an incremental backup. */ if (ib != NULL) relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE); To rescue my initdb --rel-segsize project[1] for v18, I will have a go at making that dynamic. It looks like we don't actually need to allocate that until we get to the GetFileBackupMethod() call, and at that point we have the file size. If I understand correctly, statbuf.st_size / BLCKSZ would be enough, so we could embiggen our block number buffer there if required, right? That way, you would only need shedloads of virtual memory if you used initdb --rel-segsize=shedloads and you actually have shedloads of data in a table/segment. For example, with initdb --rel-segsize=1TB and a table that contains 1TB+ of data, that'd allocate 512MB. It sounds borderline OK when put that way. It sounds not OK with --rel-segsize=infinite and 32TB of data -> palloc(16GB). I suppose one weasel-out would be to say we only support segments up to (say) 1TB, until eventually we figure out how to break this code's dependency on segments. I guess we'll have to do that one day to support incremental backups of other smgr implementations that don't even have segments (segments are a private detail of md.c after all, not part of the smgr abstraction). [1] https://commitfest.postgresql.org/48/4305/
On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Pushed. That fixes the stack problem. Cool. > Of course we still have this: > > /* > * Since this array is relatively large, avoid putting it on the stack. > * But we don't need it at all if this is not an incremental backup. > */ > if (ib != NULL) > relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE); > > To rescue my initdb --rel-segsize project[1] for v18, I will have a go > at making that dynamic. It looks like we don't actually need to > allocate that until we get to the GetFileBackupMethod() call, and at > that point we have the file size. If I understand correctly, > statbuf.st_size / BLCKSZ would be enough, so we could embiggen our > block number buffer there if required, right? Yes. > That way, you would > only need shedloads of virtual memory if you used initdb > --rel-segsize=shedloads and you actually have shedloads of data in a > table/segment. For example, with initdb --rel-segsize=1TB and a table > that contains 1TB+ of data, that'd allocate 512MB. It sounds > borderline OK when put that way. It sounds not OK with > --rel-segsize=infinite and 32TB of data -> palloc(16GB). I suppose > one weasel-out would be to say we only support segments up to (say) > 1TB, until eventually we figure out how to break this code's > dependency on segments. I guess we'll have to do that one day to > support incremental backups of other smgr implementations that don't > even have segments (segments are a private detail of md.c after all, > not part of the smgr abstraction). I think the thing to do would be to fetch and send the array of block numbers in chunks. Instead of calling BlockRefTableEntryGetBlocks() just once, you'd call it in a loop with a buffer that you know might not be big enough and then iterate until you've got everything, sending the partial block array to the client each time. Then you repeat that loop a second time as you work your way through the giant file so that you always know what the next block you need to include in the incremental file is. The only problem here is: exactly how do you iterate? If BlockRefTableEntryGetBlocks() were guaranteed to return blocks in order, then you could just keep track of the highest block number you received from the previous call to that function and pass the starting block number for the next call as that value plus one. But as it is, that doesn't work. Whoops. Bad API design on my part, but fixable. The other problem here is on the pg_combinebackup side, where we have a lot of the same issues. reconstruct.c wants to build a map of where it should get each block before it actually does any I/O. If the whole file is potentially terabytes in size, I guess that would need to use some other approach there, too. That's probably doable with a bunch of rejiggering, but not trivial. I have to admit that I'm still not at all enthusiastic about 32TB segments. I think we're going to find that incremental backup is only the tip of the iceberg. I bet we'll find that there are all kinds of weird annoyances that pop up with 10TB+ files. It could be outright lack of support, like, oh, this archive format doesn't support files larger than X size. But it could also be subtler things like, hey, this directory copy routine updates the progress bar after each file, so when some of the files are gigantic, you can no longer count on having an accurate idea of how much of the directory has been copied. I suspect that a lot of the code that turns out to have problems will not be PostgreSQL code (although I expect us to find more problems in our code, too) so we'll end up telling people "hey, it's not OUR fault your stuff doesn't work, you just need to file a bug report with ${MAINTAINER_WHO_MAY_OR_MAY_NOT_CARE_ABOUT_30TB_FILES}". And we won't catch any of the bugs in our own code or anyone else's in the regression tests, buildfarm, or cfbot, because none of that stuff is going to test with multi-terabyte files. I do understand that a 1GB segment size is not that big in 2024, and that filesystems with a 2GB limit are thought to have died out a long time ago, and I'm not against using larger segments. I do think, though, that increasing the segment size by 32768x in one shot is likely to be overdoing it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Apr 16, 2024 at 4:10 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > To rescue my initdb --rel-segsize project[1] for v18, I will have a go > > at making that dynamic. It looks like we don't actually need to > > allocate that until we get to the GetFileBackupMethod() call, and at > > that point we have the file size. If I understand correctly, > > statbuf.st_size / BLCKSZ would be enough, so we could embiggen our > > block number buffer there if required, right? > > Yes. Here is a first attempt at that. Better factoring welcome. New observations made along the way: the current coding can exceed MaxAllocSize and error out, or overflow 32 bit size_t and allocate nonsense. Do you think it'd be better to raise an error explaining that, or silently fall back to full backup (tried that way in the attached), or that + log messages? Another option would be to use huge allocations, so we only have to deal with that sort of question for 32 bit systems (i.e. effectively hypothetical/non-relevant scenarios), but I don't love that idea. > ... > I do understand that a 1GB segment size is not that big in 2024, and > that filesystems with a 2GB limit are thought to have died out a long > time ago, and I'm not against using larger segments. I do think, > though, that increasing the segment size by 32768x in one shot is > likely to be overdoing it. My intuition is that the primary interesting lines to cross are at 2GB and 4GB due to data type stuff. I defend against the most basic problem in my proposal: I don't let you exceed your off_t type, but that doesn't mean we don't have off_t/ssize_t/size_t/long snafus lurking in our code that could bite a 32 bit system with large files. If you make it past those magic numbers and your tools are all happy, I think you should be home free until you hit file system limits, which are effectively unhittable on most systems except ext4's already bemoaned 16TB limit AFAIK. But all the same, I'm contemplating limiting the range to 1TB in the first version, not because of general fear of unknown unknowns, but just because it means we don't need to use "huge" allocations for this known place, maybe until we can address that.