Обсуждение: pg_upgrade --copy-file-range
Hello, I was just in a pg_upgrade unconference session at PGCon where the lack of $SUBJECT came up. This system call gives the kernel the option to use fast block cloning on XFS, ZFS (as of very recently), etc, and works on Linux and FreeBSD. It's probably much the same as --clone mode on COW file systems, except that is Linux-only. On overwrite file systems (ie not copy-on-write, like ext4), it may also be able to push copies down to storage hardware/network file systems. There was something like this in the nearby large files patch set, but in that version it just magically did it when available in --copy mode. Now I think the user should have to have to opt in with --copy-file-range, and simply to error out if it fails. It may not work in some cases -- for example, the man page says that older Linux systems can fail with EXDEV when you try to copy across file systems, while newer systems will do something less efficient but still sensible internally; also I saw a claim that some older versions had weird bugs. Better to just expose the raw functionality and let users say when they want it and read the error if it fail, I think.
Вложения
On 02.06.23 21:30, Thomas Munro wrote: > I was just in a pg_upgrade unconference session at PGCon where the > lack of $SUBJECT came up. This system call gives the kernel the > option to use fast block cloning on XFS, ZFS (as of very recently), > etc, and works on Linux and FreeBSD. It's probably much the same as > --clone mode on COW file systems, except that is Linux-only. On > overwrite file systems (ie not copy-on-write, like ext4), it may also > be able to push copies down to storage hardware/network file systems. > > There was something like this in the nearby large files patch set, but > in that version it just magically did it when available in --copy > mode. Now I think the user should have to have to opt in with > --copy-file-range, and simply to error out if it fails. It may not > work in some cases -- for example, the man page says that older Linux > systems can fail with EXDEV when you try to copy across file systems, > while newer systems will do something less efficient but still > sensible internally; also I saw a claim that some older versions had > weird bugs. Better to just expose the raw functionality and let users > say when they want it and read the error if it fail, I think. When we added --clone, copy_file_range() was available, but the problem was that it was hard for the user to predict whether you'd get the fast clone behavior or the slow copy behavior. That's the kind of thing you want to know when planning and testing your upgrade. At the time, there were patches passed around in Linux kernel circles that would have been able to enforce cloning via the flags argument of copy_file_range(), but that didn't make it to the mainline. So, yes, being able to specify exactly which copy mechanism to use makes sense, so that users can choose the tradeoffs. About your patch: I think you should have a "check" function called from check_new_cluster(). That check function can then also handle the "not supported" case, and you don't need to handle that in parseCommandLine(). I suggest following the clone example for these, since the issues there are very similar.
On Mon, Jul 3, 2023 at 7:47 PM Peter Eisentraut <peter@eisentraut.org> wrote: > When we added --clone, copy_file_range() was available, but the problem > was that it was hard for the user to predict whether you'd get the fast > clone behavior or the slow copy behavior. That's the kind of thing you > want to know when planning and testing your upgrade. At the time, there > were patches passed around in Linux kernel circles that would have been > able to enforce cloning via the flags argument of copy_file_range(), but > that didn't make it to the mainline. > > So, yes, being able to specify exactly which copy mechanism to use makes > sense, so that users can choose the tradeoffs. Thanks for looking. Yeah, it is quite inconvenient for planning purposes that it is hard for a user to know which internal strategy it uses, but that's the interface we have (and clearly "flags" is reserved for future usage so that might still evolve..). > About your patch: > > I think you should have a "check" function called from > check_new_cluster(). That check function can then also handle the "not > supported" case, and you don't need to handle that in > parseCommandLine(). I suggest following the clone example for these, > since the issues there are very similar. Done.
Вложения
On 08.10.23 07:15, Thomas Munro wrote: >> About your patch: >> >> I think you should have a "check" function called from >> check_new_cluster(). That check function can then also handle the "not >> supported" case, and you don't need to handle that in >> parseCommandLine(). I suggest following the clone example for these, >> since the issues there are very similar. > > Done. This version looks good to me. Tiny nit: You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a different suffix.
On 13.11.23 08:15, Peter Eisentraut wrote: > On 08.10.23 07:15, Thomas Munro wrote: >>> About your patch: >>> >>> I think you should have a "check" function called from >>> check_new_cluster(). That check function can then also handle the "not >>> supported" case, and you don't need to handle that in >>> parseCommandLine(). I suggest following the clone example for these, >>> since the issues there are very similar. >> >> Done. > > This version looks good to me. > > Tiny nit: You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a > different suffix. Thomas, are you planning to proceed with this patch?
On Sat, Dec 23, 2023 at 9:40 AM Peter Eisentraut <peter@eisentraut.org> wrote: > On 13.11.23 08:15, Peter Eisentraut wrote: > > On 08.10.23 07:15, Thomas Munro wrote: > >>> About your patch: > >>> > >>> I think you should have a "check" function called from > >>> check_new_cluster(). That check function can then also handle the "not > >>> supported" case, and you don't need to handle that in > >>> parseCommandLine(). I suggest following the clone example for these, > >>> since the issues there are very similar. > >> > >> Done. > > > > This version looks good to me. > > > > Tiny nit: You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a > > different suffix. > > Thomas, are you planning to proceed with this patch? Yes. Sorry for being slow... got stuck working on an imminent new version of streaming read. I will be defrosting my commit bit and committing this one and a few things shortly. As it happens I was just thinking about this particular patch because I suddenly had a strong urge to teach pg_combinebackup to use copy_file_range. I wonder if you had the same idea...
On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote: > As it happens I was just thinking about this particular patch because > I suddenly had a strong urge to teach pg_combinebackup to use > copy_file_range. I wonder if you had the same idea... Yeah, +1. That would make copy_file_blocks() more efficient where the code is copying 50 blocks in batches because it needs to reassign checksums to the blocks copied. -- Michael
Вложения
Hi Thomas, Michael, Peter and -hackers, On Sun, Dec 24, 2023 at 3:57 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote: > > As it happens I was just thinking about this particular patch because > > I suddenly had a strong urge to teach pg_combinebackup to use > > copy_file_range. I wonder if you had the same idea... > > Yeah, +1. That would make copy_file_blocks() more efficient where the > code is copying 50 blocks in batches because it needs to reassign > checksums to the blocks copied. I've tried to achieve what you were discussing. Actually this was my first thought when using pg_combinebackup with larger (realistic) backup sizes back in December. Attached is a set of very DIRTY (!) patches that provide CoW options (--clone/--copy-range-file) to pg_combinebackup (just like pg_upgrade to keep it in sync), while also refactoring some related bits of code to avoid duplication. With XFS (with reflink=1 which is default) on Linux with kernel 5.10 and ~210GB backups, I'm getting: root@jw-test-1:/xfs# du -sm * 210229 full 250 incr.1 Today in master, the old classic read()/while() loop without CoW/reflink optimization : root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup --manifest-checksums=NONE -o outtest full incr.1 3 real 49m43.963s user 0m0.887s sys 2m52.697s VS patch with "--clone" : root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup --manifest-checksums=NONE --clone -o outtest full incr.1 3 real 0m39.812s user 0m0.325s sys 0m2.401s So it is 49mins down to 40 seconds(!) +/-10s (3 tries) if the FS supports CoW/reflinks (XFS, BTRFS, upcoming bcachefs?). It looks to me that this might mean that if one actually wants to use incremental backups (to get minimal RTO), it would be wise to only use CoW filesystems from the start so that RTO is as low as possible. Random patch notes: - main meat is in v3-0002*, I hope i did not screw something seriously - in worst case: it is opt-in through switch, so the user always can stick to the classic copy - no docs so far - pg_copyfile_offload_supported() should actually be fixed if it is a good path forward - pgindent actually indents larger areas of code that I would like to, any ideas or is it ok? - not tested on Win32/MacOS/FreeBSD - i've tested pg_upgrade manually and it seems to work and issue correct syscalls, however some tests are failing(?). I haven't investigated why yet due to lack of time. Any help is appreciated. -J.
Вложения
On 05.01.24 13:40, Jakub Wartak wrote: > Random patch notes: > - main meat is in v3-0002*, I hope i did not screw something seriously > - in worst case: it is opt-in through switch, so the user always can > stick to the classic copy > - no docs so far > - pg_copyfile_offload_supported() should actually be fixed if it is a > good path forward > - pgindent actually indents larger areas of code that I would like to, > any ideas or is it ok? > - not tested on Win32/MacOS/FreeBSD > - i've tested pg_upgrade manually and it seems to work and issue > correct syscalls, however some tests are failing(?). I haven't > investigated why yet due to lack of time. Something is wrong with the pgindent in your patch set. Maybe you used a wrong version. You should try to fix that, because it is hard to process your patch with that amount of unrelated reformatting. As far as I can tell, the original pg_upgrade patch has been ready to commit since October. Unless Thomas has any qualms that have not been made explicit in this thread, I suggest we move ahead with that. And then Jakub could rebase his patch set on top of that. It looks like if the formatting issues are fixed, the remaining pg_combinebackup support isn't that big.
On Wed, Mar 6, 2024 at 2:43 AM Peter Eisentraut <peter@eisentraut.org> wrote: > As far as I can tell, the original pg_upgrade patch has been ready to > commit since October. Unless Thomas has any qualms that have not been > made explicit in this thread, I suggest we move ahead with that. pg_upgrade --copy-file-range pushed. The only change I made was to remove the EINTR retry condition which was subtly wrong and actually not needed here AFAICS. (Erm, maybe I did have an unexpressed qualm about some bug reports unfolding around that time about corruption linked to copy_file_range that might have spooked me but those seem to have been addressed.) > And then Jakub could rebase his patch set on top of that. It looks like > if the formatting issues are fixed, the remaining pg_combinebackup > support isn't that big. +1 I'll also go and rebase CREATE DATABASE ... STRATEGY=file_clone[1]. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
Hi, I took a quick look at the remaining part adding copy_file_range to pg_combinebackup. The patch no longer applies, so I had to rebase it. Most of the issues were trivial, but I had to fix a couple missing prototypes - I added them to copy_file.h/c, mostly. 0001 is the minimal rebase + those fixes 0002 has a couple review comments in copy_file, and it also undoes a lot of unnecessary formatting changes (already pointed out by Peter a couple days ago). A couple review comments: 1) AFAIK opt_errinfo() returns pointer to the local "buf" variable. 2) I wonder if we even need opt_errinfo(). I'm not sure it actually makes anything simpler. 3) I think it'd be nice to make CopyFileMethod more consistent with transferMode in pg_upgrade.h (I mean, it seems wise to make the naming more consistent, it's probably not worth unifying this somehow). 4) I wonder how we came up with copying the files by 50 blocks, but I now realize it's been like this before this patch. I only noticed because the patch adds a comment before buffer_size calculation. 5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new name is way more generic / less descriptive - it's clear it copies the file block by block (well, in chunks). pg_copyfile is pretty vague. 6) This leaves behind copy_file_copyfile, which is now unused. 7) The patch reworks how combinebackup deals with alternative copy implementations - instead of setting strategy_implementation and calling that, the decisions now happen in pg_copyfile_offload with a lot of conditions / ifdef / defined ... I find it pretty hard to understand and reason about. I liked the strategy_implementation approach, as it forces us to keep each method in a separate function. Perhaps there's a reason why that doesn't work for copy_file_range? But in that case this needs much clearer comments. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi Tomas, > I took a quick look at the remaining part adding copy_file_range to > pg_combinebackup. The patch no longer applies, so I had to rebase it. > Most of the issues were trivial, but I had to fix a couple missing > prototypes - I added them to copy_file.h/c, mostly. > > 0001 is the minimal rebase + those fixes > > 0002 has a couple review comments in copy_file, and it also undoes a lot > of unnecessary formatting changes (already pointed out by Peter a couple > days ago). > Thank you very much for this! As discussed privately, I'm not in position right now to pursue this further at this late stage (at least for v17, which would require an aggressive schedule ). My plan was more for v18 after Peter's email, due to other obligations. But if you have cycles and want to continue, please do so without hesitation - I'll try to chime in a long way to test and review for sure. > A couple review comments: > > 1) AFAIK opt_errinfo() returns pointer to the local "buf" variable. > > 2) I wonder if we even need opt_errinfo(). I'm not sure it actually > makes anything simpler. Yes, as it stands it's broken (somewhat I've missed gcc warning), should be pg_malloc(). I hardly remember, but I wanted to avoid code duplication. No strong opinion, maybe that's a different style, I'll adapt as necessary. > 3) I think it'd be nice to make CopyFileMethod more consistent with > transferMode in pg_upgrade.h (I mean, it seems wise to make the naming > more consistent, it's probably not worth unifying this somehow). > > 4) I wonder how we came up with copying the files by 50 blocks, but I > now realize it's been like this before this patch. I only noticed > because the patch adds a comment before buffer_size calculation. It looks like it was like that before pg_upgrade even was moved into the core. 400kB is indeed bit strange value, so we can leave it as it is or make the COPY_BUF_SIZ 128kb - see [1] (i've double checked cp(1) uses still 128kB today), or maybe just stick to something like 256 or 512 kBs. > 5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new > name is way more generic / less descriptive - it's clear it copies the > file block by block (well, in chunks). pg_copyfile is pretty vague. > > 6) This leaves behind copy_file_copyfile, which is now unused. > > 7) The patch reworks how combinebackup deals with alternative copy > implementations - instead of setting strategy_implementation and calling > that, the decisions now happen in pg_copyfile_offload with a lot of > conditions / ifdef / defined ... I find it pretty hard to understand and > reason about. I liked the strategy_implementation approach, as it forces > us to keep each method in a separate function. Well some context (maybe it was my mistake to continue in this ./thread rather starting a new one): my plan was 3-in-1: in the original proposal (from Jan) to provide CoW as generic facility for other to use - in src/common/file_utils.c as per v3-0002-Confine-various-OS-copy-on-write-and-other-copy-a.patch - to unify & confine CoW methods and their quirkiness between pg_combinebackup and pg_upgrade and other potential CoW uses too. That was before Thomas M. pushed CoW just for pg_upgrade as d93627bcbe5001750e7611f0e637200e2d81dcff. I had this idea back then to have pg_copyfile() [normal blocks copy] and pg_copyfile_offload_supported(), pg_copyfile_offload(PG_COPYFILE_IOCTL_FICLONE , PG_COPYFILE_COPY_FILE_RANGE, PG_COPYFILE_who_has_idea_what_they_come_up_with_in_future). In Your's version of the patch it's local to pg_combinebackup, so it might make no sense after all. If you look at the pg_upgrade and pg_combinebackup they both have code duplication with lots of those ifs/IFs (assuming user wants to have it as drop-in [--clone/--copy/--copyfile] and platform may / may not have it). I've even considered --cow=ficlone|copy_file_range to sync both tools from CLI arguments point of view, but that would break backwards compatibility, so I did not do that. Also there's a problem with pg_combinebackup's strategy_implementation that it actually cannot on its own decide (I think) which CoW to use or not. There were some longer discussions that settled on one thing (for pg_upgrade): it's the user who is in control HOW the copy gets done (due to potential issues in OS CoW() implementations where e.g. if NFS would be involved on one side). See pg_upgrade --clone/--copy/--copy-file-range/--sync-method options. I wanted to stick to that, so pg_combinebackup also needs to give the same options to the user. That's was for the historical context, now you wrote "it's probably not worth unifying this somehow" few sentences earlier, so my take is the following: we can just concentrate on getting the copy_file_range() and ioctl_ficlone to pg_combinebackup at the price of duplicating some complexity for now (in short to start with clear plate , it doesn't necessary needs to be my patch as base if we think it's worthwhile for v17 - or stick to your reworked patch of mine). Later (v18?) some bigger than this refactor could unify and move the copy methods to some more central place (so then we would have sync as there would be no doubling like you mentioned e.g.: pg_upgrade's enum transferMode <-> patch enum CopyFileMethod. So for now I'm +1 to renaming all the things as you want -- indeed pg_copy* might not be a good fit in a localized version. -J. [1] - https://eklitzke.org/efficient-file-copying-on-linux
Here's a patch reworked along the lines from a couple days ago. The primary goals were to add clone/copy_file_range while minimizing unnecessary disruption, and overall cleanup of the patch. I'm not saying it's committable, but I think the patch is much easier to understand. The main change is that this abandons the idea of handling all possible cases in a single function that looks like a maze of ifdefs, and instead separates each case into it's own function and the decision happens much earlier. This is pretty much exactly what pg_upgrade does, BTW. There's maybe an argument that these functions could be unified and moved to a library in src/common - I can imagine doing that, but I don't think it's required. The functions are pretty trivial wrappers, and it's not like we expect many more callers. And there's probably stuff we'd need to keep out of that library (e.g. the decision which copy/clone methods are available / should be used or error reporting). So it doesn't seem worth it, at least for now. There's one question, though. As it stands, there's a bit of asymmetry between handling CopyFile() on WIN32 and the clone/copy_file_range on other platforms). On WIN32, we simply automatically switch to CopyFile automatically, if we don't need to calculate checksum. But with the other methods, error out if the user requests those and we need to calculate the checksum. The asymmetry comes from the fact there's no option to request CopyFile on WIN32, and we feel confident it's always the right thing to do (safe, faster). We don't seem to know that for the other methods, so the user has to explicitly request those. And if the user requests --clone, for example, it'd be wrong to silently fallback to plain copy. Still, I wonder if this might cause some undesirable issues during restores. But I guess that's why we have --dry-run. This asymmetry also shows a bit in the code - the CopyFile is coded and called a bit differently from the other methods. FWIW I abandoned the approach with "strategy" and just use a switch on CopyMode enum, just like pg_upgrade does. There's a couple more smaller changes: - Addition of docs for --clone/--copy-file-range (shameless copy from pg_upgrade docs). - Removal of opt_errinfo - not only was it buggy, I think the code is actually cleaner without it. - Removal of EINTR retry condition from copy_file_range handling (this is what Thomas ended up for pg_upgrade while committing that part). Put together, this cuts the patch from ~40kB to ~15kB (most of this is due to the cleanup of unnecessary whitespace changes, though). I think to make this committable, this requires some review and testing, ideally on a range of platforms. One open question is how to allow testing this. For pg_upgrade we now have PG_TEST_PG_UPGRADE_MODE, which can be set to e.g. "--clone". I wonder if we should add PG_TEST_PG_COMBINEBACKUP_MODE ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > There's one question, though. As it stands, there's a bit of asymmetry > between handling CopyFile() on WIN32 and the clone/copy_file_range on > other platforms). On WIN32, we simply automatically switch to CopyFile > automatically, if we don't need to calculate checksum. But with the > other methods, error out if the user requests those and we need to > calculate the checksum. That seems completely broken. copy_file() needs to have the ability to calculate a checksum if one is required; when one isn't required, it can do whatever it likes. So we should always fall back to the block-by-block method if we need a checksum. Whatever option the user specified should only be applied when we don't need a checksum. Consider, for example: pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C pg_basebackup -D monday -c fast --manifest-checksums=SHA224 --incremental sunday/backup_manifest pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone Any files that are copied in their entirety from Sunday's backup can be cloned, if we have support for cloning. But any files copied from Monday's backup will need to be re-checksummed, since the checksum algorithms don't match. With what you're describing, it sounds like pg_combinebackup would just fail in this case; I don't think that's the behavior that the user is going to want. -- Robert Haas EDB: http://www.enterprisedb.com
On 3/22/24 17:42, Robert Haas wrote: > On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> There's one question, though. As it stands, there's a bit of asymmetry >> between handling CopyFile() on WIN32 and the clone/copy_file_range on >> other platforms). On WIN32, we simply automatically switch to CopyFile >> automatically, if we don't need to calculate checksum. But with the >> other methods, error out if the user requests those and we need to >> calculate the checksum. > > That seems completely broken. copy_file() needs to have the ability to > calculate a checksum if one is required; when one isn't required, it > can do whatever it likes. So we should always fall back to the > block-by-block method if we need a checksum. Whatever option the user > specified should only be applied when we don't need a checksum. > > Consider, for example: > > pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C > pg_basebackup -D monday -c fast --manifest-checksums=SHA224 > --incremental sunday/backup_manifest > pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone > > Any files that are copied in their entirety from Sunday's backup can > be cloned, if we have support for cloning. But any files copied from > Monday's backup will need to be re-checksummed, since the checksum > algorithms don't match. With what you're describing, it sounds like > pg_combinebackup would just fail in this case; I don't think that's > the behavior that the user is going to want. > Right, this will happen: pg_combinebackup: error: unable to use accelerated copy when manifest checksums are to be calculated. Use --no-manifest Are you saying we should just silently override the copy method and do the copy block by block? I'm not strongly opposed to that, but it feels wrong to just ignore that the user explicitly requested cloning, and I'm not sure why should this be different from any other case when the user requests incompatible combination of options and/or options that are not supported on the current configuration. Why not just to tell the user to use the correct parameters, i.e. either remove --clone or add --no-manifest? FWIW I now realize it actually fails a bit earlier than I thought - when parsing the options, not in copy_file. But then some checks (if a given copy method is supported) happen in the copy functions. I wonder if it'd be better/possible to do all of that in one place, not sure. Also, the message only suggests to use --no-manifest. It probably should suggest removing --clone too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Right, this will happen: > > pg_combinebackup: error: unable to use accelerated copy when manifest > checksums are to be calculated. Use --no-manifest > > Are you saying we should just silently override the copy method and do > the copy block by block? Yes. > I'm not strongly opposed to that, but it feels > wrong to just ignore that the user explicitly requested cloning, and I'm > not sure why should this be different from any other case when the user > requests incompatible combination of options and/or options that are not > supported on the current configuration. I don't feel like copying block-by-block when that's needed to compute a checksum is ignoring what the user requested. I mean, if we'd had to perform reconstruction rather than copying an entire file, we would have done that regardless of whether --clone had been there, and I don't see why the need-to-compute-a-checksum case is any different. I think we should view a flag like --clone as specifying how to copy a file when we don't need to do anything but copy it. I don't think it should dictate that we're not allowed to perform other processing when that other processing is required. From my point of view, this is not a case of incompatible options having been specified. If you specify run pg_basebackup with both --format=p and --format=t, those are incompatible options; the backup can be done one way or the other, but not both at once. But here there's no such conflict. Block-by-block copying and fast-copying can happen as part of the same operation, as in the example that I showed, where some files need the block-by-block copying and some can be fast-copied. The user is entitled to specify which fast-copying method they would like to have used for the files where fast-copying is possible without getting a failure just because it isn't possible for every single file. Or to say it the other way around, if there's 1 file that needs to be copied block by block and 99 files that can be fast-copied, you want to force the user to the block-by-block method for all 100 files. I want to force the use of the block-by-block method for the 1 file where that's the only valid method, and let them choose what they want to do for the other 99. -- Robert Haas EDB: http://www.enterprisedb.com
Hmm, this discussion seems to assume that we only use copy_file_range() to copy/clone whole segment files, right? That's great and may even get most of the available benefit given typical databases with many segments of old data that never changes, but... I think copy_write_range() allows us to go further than the other whole-file clone techniques: we can stitch together parts of an old backup segment file and an incremental backup to create a new file. If you're interested in minimising disk use while also removing dependencies on the preceding chain of backups, then it might make sense to do that even if you *also* have to read the data to compute the checksums, I think? That's why I mentioned it: if copy_file_range() (ie sub-file-level block sharing) is a solution in search of a problem, has the world ever seen a better problem than pg_combinebackup?
On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Hmm, this discussion seems to assume that we only use > copy_file_range() to copy/clone whole segment files, right? That's > great and may even get most of the available benefit given typical > databases with many segments of old data that never changes, but... I > think copy_write_range() allows us to go further than the other > whole-file clone techniques: we can stitch together parts of an old > backup segment file and an incremental backup to create a new file. > If you're interested in minimising disk use while also removing > dependencies on the preceding chain of backups, then it might make > sense to do that even if you *also* have to read the data to compute > the checksums, I think? That's why I mentioned it: if > copy_file_range() (ie sub-file-level block sharing) is a solution in > search of a problem, has the world ever seen a better problem than > pg_combinebackup? That makes sense; it's just a different part of the code than I thought we were talking about. -- Robert Haas EDB: http://www.enterprisedb.com
On 3/22/24 19:40, Robert Haas wrote: > On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> Right, this will happen: >> >> pg_combinebackup: error: unable to use accelerated copy when manifest >> checksums are to be calculated. Use --no-manifest >> >> Are you saying we should just silently override the copy method and do >> the copy block by block? > > Yes. > >> I'm not strongly opposed to that, but it feels >> wrong to just ignore that the user explicitly requested cloning, and I'm >> not sure why should this be different from any other case when the user >> requests incompatible combination of options and/or options that are not >> supported on the current configuration. > > I don't feel like copying block-by-block when that's needed to compute > a checksum is ignoring what the user requested. I mean, if we'd had to > perform reconstruction rather than copying an entire file, we would > have done that regardless of whether --clone had been there, and I > don't see why the need-to-compute-a-checksum case is any different. I > think we should view a flag like --clone as specifying how to copy a > file when we don't need to do anything but copy it. I don't think it > should dictate that we're not allowed to perform other processing when > that other processing is required. > > From my point of view, this is not a case of incompatible options > having been specified. If you specify run pg_basebackup with both > --format=p and --format=t, those are incompatible options; the backup > can be done one way or the other, but not both at once. But here > there's no such conflict. Block-by-block copying and fast-copying can > happen as part of the same operation, as in the example that I showed, > where some files need the block-by-block copying and some can be > fast-copied. The user is entitled to specify which fast-copying method > they would like to have used for the files where fast-copying is > possible without getting a failure just because it isn't possible for > every single file. > > Or to say it the other way around, if there's 1 file that needs to be > copied block by block and 99 files that can be fast-copied, you want > to force the user to the block-by-block method for all 100 files. I > want to force the use of the block-by-block method for the 1 file > where that's the only valid method, and let them choose what they want > to do for the other 99. > OK, that makes sense. Here's a patch that should work like this - in copy_file we check if we need to calculate checksums, and either use the requested copy method, or fall back to the block-by-block copy. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 3/23/24 13:38, Robert Haas wrote: > On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Hmm, this discussion seems to assume that we only use >> copy_file_range() to copy/clone whole segment files, right? That's >> great and may even get most of the available benefit given typical >> databases with many segments of old data that never changes, but... I >> think copy_write_range() allows us to go further than the other >> whole-file clone techniques: we can stitch together parts of an old >> backup segment file and an incremental backup to create a new file. >> If you're interested in minimising disk use while also removing >> dependencies on the preceding chain of backups, then it might make >> sense to do that even if you *also* have to read the data to compute >> the checksums, I think? That's why I mentioned it: if >> copy_file_range() (ie sub-file-level block sharing) is a solution in >> search of a problem, has the world ever seen a better problem than >> pg_combinebackup? > > That makes sense; it's just a different part of the code than I > thought we were talking about. > Yeah, that's in write_reconstructed_file() and the patch does not touch that at all. I agree it would be nice to use copy_file_range() in this part too, and it doesn't seem it'd be that hard to do, I think. It seems we'd just need a "fork" that either calls pread/pwrite or copy_file_range, depending on checksums and what was requested. BTW is there a reason why the code calls "write" and not "pg_pwrite"? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/23/24 14:47, Tomas Vondra wrote: > On 3/23/24 13:38, Robert Haas wrote: >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> Hmm, this discussion seems to assume that we only use >>> copy_file_range() to copy/clone whole segment files, right? That's >>> great and may even get most of the available benefit given typical >>> databases with many segments of old data that never changes, but... I >>> think copy_write_range() allows us to go further than the other >>> whole-file clone techniques: we can stitch together parts of an old >>> backup segment file and an incremental backup to create a new file. >>> If you're interested in minimising disk use while also removing >>> dependencies on the preceding chain of backups, then it might make >>> sense to do that even if you *also* have to read the data to compute >>> the checksums, I think? That's why I mentioned it: if >>> copy_file_range() (ie sub-file-level block sharing) is a solution in >>> search of a problem, has the world ever seen a better problem than >>> pg_combinebackup? >> >> That makes sense; it's just a different part of the code than I >> thought we were talking about. >> > > Yeah, that's in write_reconstructed_file() and the patch does not touch > that at all. I agree it would be nice to use copy_file_range() in this > part too, and it doesn't seem it'd be that hard to do, I think. > > It seems we'd just need a "fork" that either calls pread/pwrite or > copy_file_range, depending on checksums and what was requested. > Here's a patch to use copy_file_range in write_reconstructed_file too, when requested/possible. One thing that I'm not sure about is whether to do pg_fatal() if --copy-file-range but the platform does not support it. This is more like what pg_upgrade does, but maybe we should just ignore what the user requested and fallback to the regular copy (a bit like when having to do a checksum for some files). Or maybe the check should just happen earlier ... I've been thinking about what Thomas wrote - that maybe it'd be good to do copy_file_range() even when calculating the checksum, and I think he may be right. But the current patch does not do that, and while it doesn't seem very difficult to do (at least when reconstructing the file from incremental backups) I don't have a very good intuition whether it'd be a win or not in typical cases. I have a naive question about the checksumming - if we used a merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks, wouldn't that allow calculating the hashes even without having to read the blocks, making copy_file_range more efficient? Sure, it's more complex, but a well known scheme. (OK, now I realized it'd mean we can't use tools like sha224sum to hash the files and compare to manifest. I guess that's why we don't do this ...) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sat, Mar 23, 2024 at 9:37 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > OK, that makes sense. Here's a patch that should work like this - in > copy_file we check if we need to calculate checksums, and either use the > requested copy method, or fall back to the block-by-block copy. + Use efficient file cloning (also known as <quote>reflinks</quote> on + some systems) instead of copying files to the new cluster. This can new cluster -> output directory I think your version kind of messes up the debug logging. In my version, every call to copy_file() would emit either "would copy \"%s\" to \"%s\" using strategy %s" and "copying \"%s\" to \"%s\" using strategy %s". In your version, the dry_run mode emits a string similar to the former, but creates separate translatable strings for each copy method instead of using the same one with a different value of %s. In non-dry-run mode, I think your version loses the debug logging altogether. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Mar 23, 2024 at 9:47 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > BTW is there a reason why the code calls "write" and not "pg_pwrite"? I think it's mostly because I learned to code a really long time ago. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 3/23/24 14:47, Tomas Vondra wrote: > > On 3/23/24 13:38, Robert Haas wrote: > >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: [..] > > Yeah, that's in write_reconstructed_file() and the patch does not touch > > that at all. I agree it would be nice to use copy_file_range() in this > > part too, and it doesn't seem it'd be that hard to do, I think. > > > > It seems we'd just need a "fork" that either calls pread/pwrite or > > copy_file_range, depending on checksums and what was requested. > > > > Here's a patch to use copy_file_range in write_reconstructed_file too, > when requested/possible. One thing that I'm not sure about is whether to > do pg_fatal() if --copy-file-range but the platform does not support it. [..] Hi Tomas, so I gave a go to the below patches today: - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch - v20240323-2-0002-write_reconstructed_file.patch My assessment: v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch - looks like more or less good to go v20240323-2-0002-write_reconstructed_file.patch - needs work and without that clone/copy_file_range() good effects are unlikely Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional tables with GiST and GIN indexes , just to see more WAL record types) and with backups sizes in MB like that: 106831 full 2823 incr.1 # captured after some time with pgbench -R 100 165 incr.2 # captured after some time with pgbench -R 100 Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o outtest full incr.1 incr.2 Test results of various copies on small I/O constrained XFS device: normal copy: 31m47.407s --clone copy: error: file cloning not supported on this platform (it's due #ifdef of having COPY_FILE_RANGE available) --copy-file-range: aborted, as it was taking too long , I was expecting it to accelerate, but it did not... obviously this is the transparent failover in case of calculating checksums... --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended up with ENOSPC [more on this later] --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and loop-fix: 5m1.986s but it creates corruption as it stands Issues: 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 compains about win32/mingw: [15:47:27.184] In file included from copy_file.c:22: [15:47:27.184] copy_file.c: In function ‘copy_file’: [15:47:27.184] ../../../src/include/common/logging.h:134:6: error: this statement may fall through [-Werror=implicit-fallthrough=] [15:47:27.184] 134 | if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \ [15:47:27.184] | ^ [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’ [15:47:27.184] 96 | pg_log_debug("would copy \"%s\" to \"%s\" (copy_file_range)", [15:47:27.184] | ^~~~~~~~~~~~ [15:47:27.184] copy_file.c:99:4: note: here [15:47:27.184] 99 | case COPY_MODE_COPYFILE: [15:47:27.184] | ^~~~ [15:47:27.184] cc1: all warnings being treated as errors 2. I do not know what's the consensus between --clone and --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also apply the same logic to the --help so that --clone is not visible there (for consistency?). Also the "error: file cloning not supported on this platform " is technically incorrect, Linux does support ioctl(FICLONE) and copy_file_range(), but we are just not choosing one over another (so technically it is "available"). Nitpicking I know. 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned ENOSPACE spiral-of-death-bug symptoms are like that: strace: copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192 Notice that dest_off_t (poutoff) is NULL. (gdb) where #0 0x00007f2cd56f6733 in copy_file_range (infd=8, pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9, poutoff=poutoff@entry=0x0, length=length@entry=8192, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/copy_file_range.c:28 #1 0x00005555ecd077f4 in write_reconstructed_file (copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true, checksum_ctx=0x7ffc4cdb7700, offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010, block_length=<optimized out>, output_filename=0x7ffc4cdba910 "outtest/base/5/16427.29", input_filename=0x7ffc4cdba510 "incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648 #2 reconstruct_from_incremental_file (input_filename=input_filename@entry=0x7ffc4cdba510 "incr.2/base/5/INCREMENTAL.16427.29", output_filename=output_filename@entry=0x7ffc4cdba910 "outtest/base/5/16427.29", relative_path=relative_path@entry=0x7ffc4cdbc670 "base/5", bare_file_name=bare_file_name@entry=0x5555ee2056ef "16427.29", n_prior_backups=n_prior_backups@entry=2, prior_backup_dirs=prior_backup_dirs@entry=0x7ffc4cdbf248, manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10 "base/5/16427.29", checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864, checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false, copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327 .. it's a spiral of death till ENOSPC. Reverting the v20240323-2-0002-write_reconstructed_file.patch helps. The problem lies in that do-wb=-inifity-loop (?) along with NULL for destination off_t. This seem to solves that thingy(?): - do - { - wb = copy_file_range(s->fd, &offsetmap[i], wfd, NULL, BLCKSZ, 0); + //do + //{ + wb = copy_file_range(s->fd, &offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0); if (wb < 0) pg_fatal("error while copying file range from \"%s\" to \"%s\": %m", input_filename, output_filename); - } while (wb > 0); + //} while (wb > 0); #else ...so that way I've got it down to 5mins. 3. .. but onn startup I've got this after trying psql login: invalid page in block 0 of relation base/5/1259 . I've again reverted the v20240323-2-0002 to see if that helped for next-round of pg_combinebackup --manifest-checksums=NONE --copy-file-range and after 32mins of waiting it did help indeed: I was able to login and select counts worked and matched properly the data. I've reapplied the v20240323-2-0002 (with my fix to prevent that endless loop) and the issue was again(!) there. Probably it's related to the destination offset. I couldn't find more time to look on it today and the setup was big 100GB on slow device, so just letting You know as fast as possible. 4. More efficiency is on the table option (optional patch node ; just for completeness; I dont think we have time for that? ): even if v20240323-2-0002 would work, the problem is that it would be sending syscall for every 8kB. We seem to be performing lots of per-8KB syscalls which hinder performance (both in copy_file_range and in normal copy): pread64(8, ""..., 8192, 369115136) = 8192 // 369115136 + 8192 = 369123328 (matches next pread offset) write(9, ""..., 8192) = 8192 pread64(8, ""..., 8192, 369123328) = 8192 // 369123328 + 8192 = 369131520 write(9, ""..., 8192) = 8192 pread64(8, ""..., 8192, 369131520) = 8192 // and so on write(9, ""..., 8192) = 8192 Apparently there's no merging of adjacent IO/s, so pg_combinebackup wastes lots of time on issuing instead small syscalls but it could let's say do single pread/write (or even copy_file_range()). I think it was not evident in my earlier testing (200GB; 39min vs ~40s) as I had much smaller modifications in my incremental (think of 99% of static data). 5. I think we should change the subject with new patch revision, so that such functionality for incremental backups is not buried down in the pg_upgrade thread ;) -J.
On 3/26/24 15:09, Jakub Wartak wrote: > On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> On 3/23/24 14:47, Tomas Vondra wrote: >>> On 3/23/24 13:38, Robert Haas wrote: >>>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > [..] >>> Yeah, that's in write_reconstructed_file() and the patch does not touch >>> that at all. I agree it would be nice to use copy_file_range() in this >>> part too, and it doesn't seem it'd be that hard to do, I think. >>> >>> It seems we'd just need a "fork" that either calls pread/pwrite or >>> copy_file_range, depending on checksums and what was requested. >>> >> >> Here's a patch to use copy_file_range in write_reconstructed_file too, >> when requested/possible. One thing that I'm not sure about is whether to >> do pg_fatal() if --copy-file-range but the platform does not support it. > [..] > > Hi Tomas, so I gave a go to the below patches today: > - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch > - v20240323-2-0002-write_reconstructed_file.patch > > My assessment: > > v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch - > looks like more or less good to go There's some issues with the --dry-run, pointed out by Robert. Should be fixed in the attached version. > v20240323-2-0002-write_reconstructed_file.patch - needs work and > without that clone/copy_file_range() good effects are unlikely > > Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional > tables with GiST and GIN indexes , just to see more WAL record types) > and with backups sizes in MB like that: > > 106831 full > 2823 incr.1 # captured after some time with pgbench -R 100 > 165 incr.2 # captured after some time with pgbench -R 100 > > Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee > /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o > outtest full incr.1 incr.2 > > Test results of various copies on small I/O constrained XFS device: > normal copy: 31m47.407s > --clone copy: error: file cloning not supported on this platform (it's > due #ifdef of having COPY_FILE_RANGE available) > --copy-file-range: aborted, as it was taking too long , I was > expecting it to accelerate, but it did not... obviously this is the > transparent failover in case of calculating checksums... > --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending > to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended > up with ENOSPC [more on this later] That's really strange. > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and > loop-fix: 5m1.986s but it creates corruption as it stands > Thanks. I plan to do more similar tests, once my machines get done with some other stuff. > Issues: > > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 > compains about win32/mingw: > > [15:47:27.184] In file included from copy_file.c:22: > [15:47:27.184] copy_file.c: In function ‘copy_file’: > [15:47:27.184] ../../../src/include/common/logging.h:134:6: error: > this statement may fall through [-Werror=implicit-fallthrough=] > [15:47:27.184] 134 | if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \ > [15:47:27.184] | ^ > [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’ > [15:47:27.184] 96 | pg_log_debug("would copy \"%s\" to \"%s\" > (copy_file_range)", > [15:47:27.184] | ^~~~~~~~~~~~ > [15:47:27.184] copy_file.c:99:4: note: here > [15:47:27.184] 99 | case COPY_MODE_COPYFILE: > [15:47:27.184] | ^~~~ > [15:47:27.184] cc1: all warnings being treated as errors > Yup, missing break. > 2. I do not know what's the consensus between --clone and > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also > apply the same logic to the --help so that --clone is not visible > there (for consistency?). Also the "error: file cloning not supported > on this platform " is technically incorrect, Linux does support > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one > over another (so technically it is "available"). Nitpicking I know. > That's a good question, I'm not sure. But whatever we do, we should do the same thing in pg_upgrade. Maybe there's some sort of precedent? > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned > ENOSPACE spiral-of-death-bug symptoms are like that: > > strace: > copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192 > copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192 > > Notice that dest_off_t (poutoff) is NULL. > > (gdb) where > #0 0x00007f2cd56f6733 in copy_file_range (infd=8, > pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9, > poutoff=poutoff@entry=0x0, > length=length@entry=8192, flags=flags@entry=0) at > ../sysdeps/unix/sysv/linux/copy_file_range.c:28 > #1 0x00005555ecd077f4 in write_reconstructed_file > (copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true, > checksum_ctx=0x7ffc4cdb7700, > offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010, > block_length=<optimized out>, output_filename=0x7ffc4cdba910 > "outtest/base/5/16427.29", > input_filename=0x7ffc4cdba510 > "incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648 > #2 reconstruct_from_incremental_file > (input_filename=input_filename@entry=0x7ffc4cdba510 > "incr.2/base/5/INCREMENTAL.16427.29", > output_filename=output_filename@entry=0x7ffc4cdba910 > "outtest/base/5/16427.29", > relative_path=relative_path@entry=0x7ffc4cdbc670 "base/5", > bare_file_name=bare_file_name@entry=0x5555ee2056ef "16427.29", > n_prior_backups=n_prior_backups@entry=2, > prior_backup_dirs=prior_backup_dirs@entry=0x7ffc4cdbf248, > manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10 > "base/5/16427.29", > checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864, > checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false, > copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327 > > .. it's a spiral of death till ENOSPC. Reverting the > v20240323-2-0002-write_reconstructed_file.patch helps. The problem > lies in that do-wb=-inifity-loop (?) along with NULL for destination > off_t. This seem to solves that thingy(?): > > - do > - { > - wb = copy_file_range(s->fd, > &offsetmap[i], wfd, NULL, BLCKSZ, 0); > + //do > + //{ > + wb = copy_file_range(s->fd, > &offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0); > if (wb < 0) > pg_fatal("error while copying > file range from \"%s\" to \"%s\": %m", > > input_filename, output_filename); > - } while (wb > 0); > + //} while (wb > 0); > #else > > ...so that way I've got it down to 5mins. > Yeah, that retry logic is wrong. I ended up copying the check from the "regular" copy branch, which simply bails out if copy_file_range returns anything but the expected 8192. I wonder if this should deal with partial writes, though. I mean, it's allowed copy_file_range() copies only some of the bytes - I don't know how often / in what situations that happens, though ... And if we want to handle that for copy_file_range(), pwrite() needs the same treatment. > 3. .. but onn startup I've got this after trying psql login: invalid > page in block 0 of relation base/5/1259 . I've again reverted the > v20240323-2-0002 to see if that helped for next-round of > pg_combinebackup --manifest-checksums=NONE --copy-file-range and after > 32mins of waiting it did help indeed: I was able to login and select > counts worked and matched properly the data. I've reapplied the > v20240323-2-0002 (with my fix to prevent that endless loop) and the > issue was again(!) there. Probably it's related to the destination > offset. I couldn't find more time to look on it today and the setup > was big 100GB on slow device, so just letting You know as fast as > possible. > Can you see if you can still reproduce this with the attached version? > 4. More efficiency is on the table option (optional patch node ; just > for completeness; I dont think we have time for that? ): even if > v20240323-2-0002 would work, the problem is that it would be sending > syscall for every 8kB. We seem to be performing lots of per-8KB > syscalls which hinder performance (both in copy_file_range and in > normal copy): > > pread64(8, ""..., 8192, 369115136) = 8192 // 369115136 + 8192 = > 369123328 (matches next pread offset) > write(9, ""..., 8192) = 8192 > pread64(8, ""..., 8192, 369123328) = 8192 // 369123328 + 8192 = 369131520 > write(9, ""..., 8192) = 8192 > pread64(8, ""..., 8192, 369131520) = 8192 // and so on > write(9, ""..., 8192) = 8192 > > Apparently there's no merging of adjacent IO/s, so pg_combinebackup > wastes lots of time on issuing instead small syscalls but it could > let's say do single pread/write (or even copy_file_range()). I think > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I > had much smaller modifications in my incremental (think of 99% of > static data). > Yes, I've been thinking about exactly this optimization, but I think we're way past proposing this for PG17. The changes that would require in reconstruct_from_incremental_file are way too significant. Has to wait for PG18 ;-) I do think there's more on the table, as mentioned by Thomas a couple days ago - maybe we shouldn't approach clone/copy_file_range merely as an optimization to save time, it might be entirely reasonable to do this simply to allow the filesystem to do CoW magic and save space (even if we need to read the data and recalculate the checksum, which now disables these copy methods). > 5. I think we should change the subject with new patch revision, so > that such functionality for incremental backups is not buried down in > the pg_upgrade thread ;) > OK. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: [..] > > That's really strange. Hi Tomas, but it looks like it's fixed now :) > > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s > > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and > > loop-fix: 5m1.986s but it creates corruption as it stands > > > > Thanks. I plan to do more similar tests, once my machines get done with > some other stuff. Please do so as I do not trust my fingers :-) > > Issues: > > > > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 > > compains about win32/mingw: > > [..] > > > > Yup, missing break. Now it's https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10 , reproducible with "make -s headerscheck EXTRAFLAGS='-fmax-errors=10'": /tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error: unknown type name ‘CopyMode’ / CopyMode copy_mode); to me it looks like reconstruct.h needs to include definition of CopyMode which is in "#include "copy_file.h" > > 2. I do not know what's the consensus between --clone and > > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif > > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also > > apply the same logic to the --help so that --clone is not visible > > there (for consistency?). Also the "error: file cloning not supported > > on this platform " is technically incorrect, Linux does support > > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one > > over another (so technically it is "available"). Nitpicking I know. > > > > That's a good question, I'm not sure. But whatever we do, we should do > the same thing in pg_upgrade. Maybe there's some sort of precedent? Sigh, you are right... It's consistent hell. > > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned > > ENOSPACE spiral-of-death-bug symptoms are like that: [..] > > Yeah, that retry logic is wrong. I ended up copying the check from the > "regular" copy branch, which simply bails out if copy_file_range returns > anything but the expected 8192. > > I wonder if this should deal with partial writes, though. I mean, it's > allowed copy_file_range() copies only some of the bytes - I don't know > how often / in what situations that happens, though ... And if we want > to handle that for copy_file_range(), pwrite() needs the same treatment. Maybe that helps? https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427 , it's harder than I anticipated (we can ignore the sparse logic though, I think) > > 3. .. but onn startup I've got this after trying psql login: invalid > > page in block 0 of relation base/5/1259 . [..] > > Can you see if you can still reproduce this with the attached version? Looks like it's fixed now and it works great (~3min, multiple times)! BTW: I've tried to also try it over NFSv4 over loopback with XFS as copy_file_range() does server-side optimization probably, but somehow it was so slow there that's it is close to being unusable (~9GB out of 104GB reconstructed after 45mins) - maybe it's due to NFS mount opts, i don't think we should worry too much. I think it's related to missing the below optimization if that matters. I think it's too early to warn users about NFS (I've spent on it just 10 mins), but on the other hand people might complain it's broken... > > Apparently there's no merging of adjacent IO/s, so pg_combinebackup > > wastes lots of time on issuing instead small syscalls but it could > > let's say do single pread/write (or even copy_file_range()). I think > > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I > > had much smaller modifications in my incremental (think of 99% of > > static data). > > > > Yes, I've been thinking about exactly this optimization, but I think > we're way past proposing this for PG17. The changes that would require > in reconstruct_from_incremental_file are way too significant. Has to > wait for PG18 ;-) Sure thing! > I do think there's more on the table, as mentioned by Thomas a couple > days ago - maybe we shouldn't approach clone/copy_file_range merely as > an optimization to save time, it might be entirely reasonable to do this > simply to allow the filesystem to do CoW magic and save space (even if > we need to read the data and recalculate the checksum, which now > disables these copy methods). Sure ! I think time will still be a priority though, as pg_combinebackup duration impacts RTO while disk space is relatively cheap. One could argue that reconstructing 50TB will be a challenge though. Now my tests indicate space saving is already happening with 0002 patch - 100GB DB / full backup stats look like that (so we are good I think when using CoW - not so without using CoW) -- or i misunderstood something?: root@jw-test-1:/backups# du -sm /backups/ 214612 /backups/ root@jw-test-1:/backups# du -sm * 106831 full 2823 incr.1 165 incr.2 104794 outtest root@jw-test-1:/backups# df -h . # note this double confirms that just 114GB is used (XFS), great! Filesystem Size Used Avail Use% Mounted on /dev/sdb1 500G 114G 387G 23% /backups root@jw-test-1:/backups# # https://github.com/pwaller/sharedextents root@jw-test-1:/backups# ./sharedextents-linux-amd64 full/base/5/16427.68 outtest/base/5/16427.68 1056915456 / 1073741824 bytes (98.43%) # extents reuse Now I was wondering a little bit if the huge XFS extent allocation won't hurt read performance (probably they were created due many independent copy_file_range() calls): root@jw-test-1:/backups# filefrag full/base/5/16427.68 full/base/5/16427.68: 1 extent found root@jw-test-1:/backups# filefrag outtest/base/5/16427.68 outtest/base/5/16427.68: 3979 extents found However in first look on seq reads of such CoW file it's still good (I'm assuming such backup after reconstruction would be copied back to the proper DB server from this backup server): root@jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches root@jw-test-1:/backups# time cat outtest/base/5/16427.68 > /dev/null real 0m4.286s root@jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches root@jw-test-1:/backups# time cat full/base/5/16427.68 > /dev/null real 0m4.325s Now Thomas wrote there "then it might make sense to do that even if you *also* have to read the data to compute the checksums, I think? " ... sounds like read(), checksum() and still do copy_file_range() instead of pwrite? PG 18 ? :D -J.
On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The patch I shared a couple minutes ago should fix this, effectively > restoring the original debug behavior. I liked the approach with calling > strategy_implementation a bit more, I wonder if it'd be better to go > back to that for the "accelerated" copy methods, somehow. Somehow I don't see this patch? -- Robert Haas EDB: http://www.enterprisedb.com
On 3/28/24 21:45, Robert Haas wrote: > On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> The patch I shared a couple minutes ago should fix this, effectively >> restoring the original debug behavior. I liked the approach with calling >> strategy_implementation a bit more, I wonder if it'd be better to go >> back to that for the "accelerated" copy methods, somehow. > > Somehow I don't see this patch? > It's here: https://www.postgresql.org/message-id/90866c27-265a-4adb-89d0-18c8dd22bc19%40enterprisedb.com I did change the subject to reflect that it's no longer about pg_upgrade, maybe that breaks the threading for you somehow? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > [ new patches ] Tomas, thanks for pointing me to this email; as you speculated, gmail breaks threading if the subject line changes. The documentation still needs work here: - It refers to --link mode, which is not a thing. - It should talk about the fact that in some cases block-by-block copying will still be needed, possibly mentioning that it specifically happens when the old backup manifest is not available or does not contain checksums or does not contain checksums of the right type, or maybe just being a bit vague. In copy_file.c: - You added an unnecessary blank line to the beginning of a comment block. - You could keep the strategy_implementation variable here. I don't think that's 100% necessary, but now that you've got the code structured this way, there's no compelling reason to remove it. - I don't know what the +/* XXX maybe this should do the check internally, same as the other functions? */ comment is talking about. - Maybe these functions should have header comments. -- Robert Haas EDB: http://www.enterprisedb.com
On 3/29/24 15:23, Robert Haas wrote: > On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> [ new patches ] > > Tomas, thanks for pointing me to this email; as you speculated, gmail > breaks threading if the subject line changes. > > The documentation still needs work here: > > - It refers to --link mode, which is not a thing. > > - It should talk about the fact that in some cases block-by-block > copying will still be needed, possibly mentioning that it specifically > happens when the old backup manifest is not available or does not > contain checksums or does not contain checksums of the right type, or > maybe just being a bit vague. > > In copy_file.c: > > - You added an unnecessary blank line to the beginning of a comment block. > Thanks, should be all cleaned up now, I think. > - You could keep the strategy_implementation variable here. I don't > think that's 100% necessary, but now that you've got the code > structured this way, there's no compelling reason to remove it. > Yeah, I think you're right. The strategy_implementation seemed a bit weird to me because we now have 4 functions with different signatures. Most only take srd/dst, but copy_file_blocks() also takes checksum. And it seemed better to handle everything the same way, rather than treating copy_file_blocks as an exception. But it's not that bad, so 0001 has strategy_implementation again. But I'll get back to this in a minute. > - I don't know what the +/* XXX maybe this should do the check > internally, same as the other functions? */ comment is talking about. > I think this is stale. The XXX is about how the various functions detect/report support. In most we have the ifdefs/pg_fatal() inside the function, but CopyFile() has nothing like that, because the detection happens earlier. I wasn't sure if maybe we should make all these functions more alike, but I don't think we should. > - Maybe these functions should have header comments. > Right, added. I was thinking about the comment [1] from a couple a days ago, where Thomas suggested that maybe we should try doing the CoW stuff (clone/copy_file_range) even in cases when we need to read the block, say to calculate checksum, or even reconstruct from incremental backups. I wasn't sure how big the benefits of the patches shared so far might be, so I decided to do some tests. I did a fairly simple thing: 1) initialize a cluster with pgbench scale 5000 (~75GB) 2) create a full backup 3) do a run that updates ~1%, 10% and 20% of the blocks 4) create an incremental backup after each run 5) do pg_combinebackup for for each of the increments, with block-by-block copy and copy_file_range, measure how long it takes and how much disk space it consumes I did this on xfs and btrfs, and it quickly became obvious that there's very little benefit unless --no-manifest is used. Which makes perfect sense, because pgbench is uniform updates so all segments need to be reconstructed from increments (so copy_file.c is mostly irrelevant), and write_reconstructed_file only uses copy_file_range() without checksums. I don't know how common --no-manifest is going to be, but I guess people will want to keep manifests in at least some backup schemes (e.g. to rebuild full backup instead of having to take a full backup regularly). So I decided to take a stab at Thomas' idea, i.e. reading the data to calculate checksums, but then using copy_file_range instead of just writing the data onto disk. This is in 0003, which relaxes the conditions in 0002 shared a couple days ago. And this helped a lot. The attached PDF shows results for xfs/btrfs. Charts on the left are disk space occupied by the reconstructed backup, measured as difference between "df" before and after running pg_combinebackup. The duration of the pg_combinebackup execution is on the right. First row is without manifest (i.e. --no-manifest), the second row is with manifest. The 1%, 10% and 20% groups are for the various increments, updating different fractions of the database. The database is ~75GB, so that's what we expect a plain copy to have. If there are some CoW benefits of copy_file_range, allowing the fs to reuse some of the space or, the disk space should be reduced. And similarly, there could/should be some improvements in pg_combinebackup duration. Each bar is a different copy method and patch: * copy on master/0001/0002/0003 - we don't expect any difference between these, it should all perform the same and use the "full" space * copy_file_range on 0001/0002/0003 - 0001 should perform the same as copy, because it's only about full-segment copies, and we don't any of those here, 0002/0003 should help, depending on --no-manifest And indeed, this is what we see. 0002/0003 use only a fraction of disk space, roughly the same as the updated fraction (which matches the size of the increment). This is nice. For duration, the benefits seem to depend on the file system. For btrfs it actually is faster, as expected. 0002/0003 saves maybe 30-50% of time, compared to block-by-block copy. On XFS it's not that great, the copy_file_range is actually slower by up to about 50%. And this is not about the extra read - this affects the 0002/no-manifest case too, where the read is not necessary. I think this is fine. It's a tradeoff, where on some filesystems you can save time or space, and on other filesystems you can save both. That's a tradeoff for the users to decide, I think. I'll see how this works on EXT4/ZFS next ... But thinking about this a bit more, I realized there's no reason not to apply the same logic to the copy_file part. I mean, if we need to copy a file but also calculate a checksum, we can simply do the clone using clone/copy_file_range, but then also read the file and calculate the checksum ... 0004 does this, by simply passing the checksum_cxt around, which also has the nice consequence that all the functions now have the same signature, which makes the strategy_implementation work for all cases. I need to do more testing of this, but like how this looks. Of course, maybe there's not an agreement this is the right way to approach this, and we should do the regular block-by-block copy? There's one more change in 0003 - the checks if clone/copy_file_range are supported by the platform now happen right at the beginning when parsing the arguments, so that when a user specifies one of those options, the error happens right away instead of sometime later when we happen to hit one of those pg_fatal() places. I think this is the right place to do these checks, as it makes the write_reconstructed_file much easier to understand (without all the ifdefs etc.). But there's an argument whether this should fail with pg_fatal() or just fallback to the default copy method. BTW I wonder if it makes sense to only allow one of those methods? At the moment the user can specify both --clone and --copy-file-range, and which one "wins" depends on the order in which they are specified. Seems confusing at best. But maybe it'd make sense to allow both, and e.g. use clone() to copy whole segments and copy_file_range() for other places? regards [1] https://www.postgresql.org/message-id/CA%2BhUKG%2B8KDk%2BpM6vZHWT6XtZzh-sdieUDohcjj0fia6aqK3Oxg%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > So I decided to take a stab at Thomas' idea, i.e. reading the data to > ... > I'll see how this works on EXT4/ZFS next ... Wow, very cool! A couple of very quick thoughts/notes: ZFS: the open source version only gained per-file block cloning in 2.2, so if you're on an older release I expect copy_file_range() to work but not really do the magic thing. On the FreeBSD version you also have to turn cloning on with a sysctl because people were worried about bugs in early versions so by default you still get actual copying, not sure if you need something like that on the Linux version... (Obviously ZFS is always completely COW-based, but before the new block cloning stuff it could only share blocks by its own magic deduplication if enabled, or by cloning/snapshotting a whole dataset/mountpoint; there wasn't a way to control it explicitly like this.) Alignment: block sharing on any fs requires it. I haven't re-checked recently but IIRC the incremental file format might have a non-block-sized header? That means that if you copy_file_range() from both the older backup and also the incremental backup, only the former will share blocks, and the latter will silently be done by copying to newly allocated blocks. If that's still true, I'm not sure how hard it would be to tweak the format to inject some padding and to make sure that there isn't any extra header before each block.
+ wb = copy_file_range(s->fd, &offsetmap[i], wfd, NULL, BLCKSZ, 0); Can you collect adjacent blocks in one multi-block call? And then I think the contract is that you need to loop if it returns short.
On 3/31/24 03:03, Thomas Munro wrote: > On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> So I decided to take a stab at Thomas' idea, i.e. reading the data to >> ... >> I'll see how this works on EXT4/ZFS next ... > > Wow, very cool! A couple of very quick thoughts/notes: > > ZFS: the open source version only gained per-file block cloning in > 2.2, so if you're on an older release I expect copy_file_range() to > work but not really do the magic thing. On the FreeBSD version you > also have to turn cloning on with a sysctl because people were worried > about bugs in early versions so by default you still get actual > copying, not sure if you need something like that on the Linux > version... (Obviously ZFS is always completely COW-based, but before > the new block cloning stuff it could only share blocks by its own > magic deduplication if enabled, or by cloning/snapshotting a whole > dataset/mountpoint; there wasn't a way to control it explicitly like > this.) > I'm on 2.2.2 (on Linux). But there's something wrong, because the pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. I'm not sure it's a ZFS config issue, though, because it's not CPU or I/O bound, and I see this on both machines. And some simple dd tests show the zpool can do 10x the throughput. Could this be due to the file header / pool alignment? > Alignment: block sharing on any fs requires it. I haven't re-checked > recently but IIRC the incremental file format might have a > non-block-sized header? That means that if you copy_file_range() from > both the older backup and also the incremental backup, only the former > will share blocks, and the latter will silently be done by copying to > newly allocated blocks. If that's still true, I'm not sure how hard > it would be to tweak the format to inject some padding and to make > sure that there isn't any extra header before each block. I admit I'm not very familiar with the format, but you're probably right there's a header, and header_length does not seem to consider alignment. make_incremental_rfile simply does this: /* Remember length of header. */ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + sizeof(rf->truncation_block_length) + sizeof(BlockNumber) * rf->num_blocks; and sendFile() does the same thing when creating incremental basebackup. I guess it wouldn't be too difficult to make sure to align this to BLCKSZ or something like this. I wonder if the file format is documented somewhere ... It'd certainly be nicer to tweak before v18, if necessary. Anyway, is that really a problem? I mean, in my tests the CoW stuff seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe that's why it took longer on XFS ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I'm on 2.2.2 (on Linux). But there's something wrong, because the > pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. > > I'm not sure it's a ZFS config issue, though, because it's not CPU or > I/O bound, and I see this on both machines. And some simple dd tests > show the zpool can do 10x the throughput. Could this be due to the file > header / pool alignment? Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with the same 128kB record as you copy_file_range 16 x 8kB blocks? (Guessing you might be using the default recordsize?) > I admit I'm not very familiar with the format, but you're probably right > there's a header, and header_length does not seem to consider alignment. > make_incremental_rfile simply does this: > > /* Remember length of header. */ > rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + > sizeof(rf->truncation_block_length) + > sizeof(BlockNumber) * rf->num_blocks; > > and sendFile() does the same thing when creating incremental basebackup. > I guess it wouldn't be too difficult to make sure to align this to > BLCKSZ or something like this. I wonder if the file format is documented > somewhere ... It'd certainly be nicer to tweak before v18, if necessary. > > Anyway, is that really a problem? I mean, in my tests the CoW stuff > seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe > that's why it took longer on XFS ... Yeah I'm not sure, I assume it did more allocating and copying because of that. It doesn't matter and it would be fine if a first version weren't as good as possible, and fine if we tune the format later once we know more, ie leaving improvements on the table. I just wanted to share the observation. I wouldn't be surprised if the block-at-a-time coding makes it slower and maybe makes the on disk data structures worse, but I dunno I'm just guessing. It's also interesting but not required to figure out how to tune ZFS well for this purpose right now...
On 3/31/24 06:46, Thomas Munro wrote: > On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> I'm on 2.2.2 (on Linux). But there's something wrong, because the >> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. >> >> I'm not sure it's a ZFS config issue, though, because it's not CPU or >> I/O bound, and I see this on both machines. And some simple dd tests >> show the zpool can do 10x the throughput. Could this be due to the file >> header / pool alignment? > > Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with > the same 128kB record as you copy_file_range 16 x 8kB blocks? > (Guessing you might be using the default recordsize?) > No, I reduced the record size to 8kB. And the pgbench init takes about the same as on other filesystems on this hardware, I think. ~10 minutes for scale 5000. >> I admit I'm not very familiar with the format, but you're probably right >> there's a header, and header_length does not seem to consider alignment. >> make_incremental_rfile simply does this: >> >> /* Remember length of header. */ >> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + >> sizeof(rf->truncation_block_length) + >> sizeof(BlockNumber) * rf->num_blocks; >> >> and sendFile() does the same thing when creating incremental basebackup. >> I guess it wouldn't be too difficult to make sure to align this to >> BLCKSZ or something like this. I wonder if the file format is documented >> somewhere ... It'd certainly be nicer to tweak before v18, if necessary. >> >> Anyway, is that really a problem? I mean, in my tests the CoW stuff >> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe >> that's why it took longer on XFS ... > > Yeah I'm not sure, I assume it did more allocating and copying because > of that. It doesn't matter and it would be fine if a first version > weren't as good as possible, and fine if we tune the format later once > we know more, ie leaving improvements on the table. I just wanted to > share the observation. I wouldn't be surprised if the block-at-a-time > coding makes it slower and maybe makes the on disk data structures > worse, but I dunno I'm just guessing. > > It's also interesting but not required to figure out how to tune ZFS > well for this purpose right now... No idea. Any idea if there's some good ZFS statistics to check? -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I've been running some benchmarks and experimenting with various stuff, trying to improve the poor performance on ZFS, and the regression on XFS when using copy_file_range. And oh boy, did I find interesting stuff ... Attached is a PDF with results of my benchmark for ZFS/XFS/BTRFS, on my two machines. I already briefly described what the benchmark does, but to clarify: 1) generate backups: initialize pgbench scale 5000, do full backup, update roughly 1%, 10% and 20% blocks and do an incremental backup after each of those steps 2) combine backups: full + 1%, full + 1% + 10%, full + 1% + 10% + 20% 3) measure how long it takes and how much more disk space is used (to see how well the CoW stuff works) 4) after each pg_combinebackup run to pg_verifybackup, start the cluster to finish recovery, run pg_checksums --check (to check the patches don't produce something broken) There's a lot of interesting stuff to discuss, some of which was already mentioned in this thread earlier - in particular, I want to talk about block alignment, prefetching and processing larger chunks of blocks. Attached is also all the patches including the ugly WIP parts discussed later, complete results if you want to do your own analysis, and the scripts used to generate/restore scripts. FWIW I'm not claiming the patches are commit-ready (especially the new WIP parts), but should be correct and good enough for discussion (that applies especially to 0007). I think I could get them ready in a day or two, but I'd like some feedback to my findings, and also if someone would have objections to get this in so short before the feature freeze, I'd prefer to know about that. The patches are numbered the same as in the benchmark results, i.e. 0001 is "1", 0002 is "2" etc. The "0-baseline" option is current master without any patches. Now to the findings .... 1) block alignment ------------------ This was mentioned by Thomas a couple days ago, when he pointed out the incremental files have a variable-length header (to record which blocks are stored in the file), followed by the block data, which means the block data is not aligned to fs block. I haven't realized this, I just used whatever the reconstruction function received, but Thomas pointed out this may interfere with CoW, which needs the blocks to be aligned. And I think he's right, and my tests confirm this. I did a trivial patch to align the blocks to 8K boundary, by forcing the header to be a multiple of 8K (I think 4K alignment would be enough). See the 0001 patch that does this. And if I measure the disk space used by pg_combinebackup, and compare the results with results without the patch ([1] from a couple days back), I see this: pct not aligned aligned ------------------------------------- 1% 689M 19M 10% 3172M 22M 20% 13797M 27M Yes, those numbers are correct. I didn't believe this at first, but the backups are valid/verified, checksums are OK, etc. BTRFS has similar numbers (e.g. drop from 20GB to 600MB). If you look at the charts in the PDF, charts for on-disk space are on the right side. It might seem like copy_file_range/CoW has no impact, but that's just an illusion - the bars for the last three cases are so small it's difficult to see them (especially on XFS). While this does not show the impact of alignment (because all of the cases in these runs have blocks aligned), it shows how tiny the backups can be made. But it does have significant impact, per the preceding paragraph. This also affect the prefetching, that I'm going to talk about next. But having the blocks misaligned (spanning multiple 4K pages) forces the system to prefetch more pages than necessary. I don't know how big the impact is, because the prefetch patch is 0002, so I only have results for prefetching on aligned blocks, but I don't see how it could not have a cost. I do think we should just align the blocks properly. The 0001 patch does that simply by adding a bunch of \0 bytes up to the next 8K boundary. Yes, this has a cost - if you have tiny files with only one or two blocks changed, the increment file will be a bit larger. Files without any blocks don't need alignment/padding, and as the number of blocks increases, it gets negligible pretty quickly. Also, files use a multiple of fs blocks anyway, so if we align to 4K blocks it wouldn't actually need more space at all. And even if it does, it's all \0, so pretty damn compressible (and I'm sorry, but if you care about tiny amounts of data added by alignment, but refuse to use compression ...). I think we absolutely need to align the blocks in the incremental files, and I think we should do that now. I think 8K would work, but maybe we should add alignment parameter to basebackup & manifest? The reason why I think maybe this should be a basebackup parameter is the recent discussion about large fs blocks - it seems to be in the works, so maybe better to be ready and not assume all fs have 4K. And I think we probably want to do this now, because this affects all tools dealing with incremental backups - even if someone writes a custom version of pg_combinebackup, it will have to deal with misaligned data. Perhaps there might be something like pg_basebackup that "transforms" the data received from the server (and also the backup manifest), but that does not seem like a great direction. Note: Of course, these space savings only exist thanks to sharing blocks with the input backups, because the blocks in the combined backup point to one of the other backups. If those old backups are removed, then the "saved space" disappears because there's only a single copy. 2) prefetch ----------- I was very puzzled by the awful performance on ZFS. When every other fs (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning advice I could think of, with almost no effect. Ultimately I decided that it probably is the "no readahead" behavior I've observed on ZFS. I assume it's because it doesn't use the page cache where the regular readahead is detected etc. And there's no prefetching in pg_combinebackup, so I decided to an experiment and added a trivial explicit prefetch when reconstructing the file - every time we'd read data from a file, we do posix_fadvise for up to 128 blocks ahead (similar to what bitmap heap scan code does). See 0002. And tadaaa - the duration dropped from 900-1000 seconds to only about 250-300 seconds, so an improvement of a factor of 3-4x. I think this is pretty massive. There's a couple more interesting ZFS details - the prefetching seems to be necessary even when using copy_file_range() and don't need to read the data (to calculate checksums). This is why the "manifest=off" chart has the strange group of high bars at the end - the copy cases are fast because prefetch happens, but if we switch to copy_file_range() there are no prefetches and it gets slow. This is a bit bizarre, especially because the manifest=on cases are still fast, exactly because the pread + prefetching still happens. I'm sure users would find this puzzling. Unfortunately, the prefetching is not beneficial for all filesystems. For XFS it does not seem to make any difference, but on BTRFS it seems to cause a regression. I think this means we may need a "--prefetch" option, that'd force prefetching, probably both before pread and copy_file_range. Otherwise people on ZFS are doomed and will have poor performance. 3) bulk operations ------------------ Another thing suggested by Thomas last week was that maybe we should try detecting longer runs of blocks coming from the same file, and operate on them as a single chunk of data. If you see e.g. 32 blocks, instead of doing read/write or copy_file_range for each of them, we could simply do one call for all those blocks at once. I think this is pretty likely, especially for small incremental backups where most of the blocks will come from the full backup. And I was suspecting the XFS regression (where the copy-file-range was up to 30-50% slower in some cases, see [1]) is related to this, because the perf profiles had stuff like this: 97.28% 2.10% pg_combinebacku [kernel.vmlinux] [k] | |--95.18%--entry_SYSCALL_64 | | | --94.99%--do_syscall_64 | | | |--74.13%--__do_sys_copy_file_range | | | | | --73.72%--vfs_copy_file_range | | | | | --73.14%--xfs_file_remap_range | | | | | |--70.65%--xfs_reflink_remap_blocks | | | | | | | --69.86%--xfs_reflink_remap_extent So I took a stab at this in 0007, which detects runs of blocks coming from the same source file (limited to 128 blocks, i.e. 1MB). I only did this for the copy_file_range() calls in 0007, and the results for XFS look like this (complete results are in the PDF): old (block-by-block) new (batches) ------------------------------------------------------ 1% 150s 4s 10% 150-200s 46s 20% 150-200s 65s Yes, once again, those results are real, the backups are valid etc. So not only it takes much less space (thanks to block alignment), it also takes much less time (thanks to bulk operations). The cases with "manifest=on" improve too, but not nearly this much. I believe this is simply because the read/write still happens block by block. But it shouldn't be difficult to do in a bulk manner too (we already have the range detected, but I was lazy). [1] https://www.postgresql.org/message-id/0e27835d-dab5-49cd-a3ea-52cf6d9ef59e%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
- v20240401-0001-WIP-block-alignment.patch
- v20240401-0002-WIP-prefetch-blocks-when-reconstructing-fi.patch
- v20240401-0003-use-clone-copy_file_range-to-copy-whole-fi.patch
- v20240401-0004-use-copy_file_range-in-write_reconstructed.patch
- v20240401-0005-use-copy_file_range-with-checksums.patch
- v20240401-0006-allow-cloning-with-checksum-calculation.patch
- v20240401-0007-WIP-copy-larger-chunks-from-the-same-file.patch
- xeon-nvme-xfs.csv
- i5-ssd-zfs.csv
- xeon-nvme-btrfs.csv
- benchmark-results.pdf
- generate-backups.sh
- restore-backups.sh
On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > And I think he's right, and my tests confirm this. I did a trivial patch > to align the blocks to 8K boundary, by forcing the header to be a > multiple of 8K (I think 4K alignment would be enough). See the 0001 > patch that does this. > > And if I measure the disk space used by pg_combinebackup, and compare > the results with results without the patch ([1] from a couple days > back), I see this: > > pct not aligned aligned > ------------------------------------- > 1% 689M 19M > 10% 3172M 22M > 20% 13797M 27M > > Yes, those numbers are correct. I didn't believe this at first, but the > backups are valid/verified, checksums are OK, etc. BTRFS has similar > numbers (e.g. drop from 20GB to 600MB). Fantastic. > I think we absolutely need to align the blocks in the incremental files, > and I think we should do that now. I think 8K would work, but maybe we > should add alignment parameter to basebackup & manifest? > > The reason why I think maybe this should be a basebackup parameter is > the recent discussion about large fs blocks - it seems to be in the > works, so maybe better to be ready and not assume all fs have 4K. > > And I think we probably want to do this now, because this affects all > tools dealing with incremental backups - even if someone writes a custom > version of pg_combinebackup, it will have to deal with misaligned data. > Perhaps there might be something like pg_basebackup that "transforms" > the data received from the server (and also the backup manifest), but > that does not seem like a great direction. +1, and I think BLCKSZ is the right choice. > I was very puzzled by the awful performance on ZFS. When every other fs > (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took > 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning > advice I could think of, with almost no effect. > > Ultimately I decided that it probably is the "no readahead" behavior > I've observed on ZFS. I assume it's because it doesn't use the page > cache where the regular readahead is detected etc. And there's no > prefetching in pg_combinebackup, so I decided to an experiment and added > a trivial explicit prefetch when reconstructing the file - every time > we'd read data from a file, we do posix_fadvise for up to 128 blocks > ahead (similar to what bitmap heap scan code does). See 0002. > > And tadaaa - the duration dropped from 900-1000 seconds to only about > 250-300 seconds, so an improvement of a factor of 3-4x. I think this is > pretty massive. Interesting. ZFS certainly has its own prefetching heuristics with lots of logic and settings, but it could be that it's using strict-next-block detection of access pattern (ie what I called effective_io_readahead_window=0 in the streaming I/O thread) instead of a window (ie like the Linux block device level read ahead where, AFAIK, if you access anything in that sliding window it is triggered), and perhaps your test has a lot of non-contiguous but close-enough blocks? (Also reminds me of the similar discussion on the BHS thread about distinguishing sequential access from mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the fine line between them.) You could double-check this and related settings (for example I think it might disable itself automatically if you're on a VM with small RAM size): https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable > There's a couple more interesting ZFS details - the prefetching seems to > be necessary even when using copy_file_range() and don't need to read > the data (to calculate checksums). This is why the "manifest=off" chart > has the strange group of high bars at the end - the copy cases are fast > because prefetch happens, but if we switch to copy_file_range() there > are no prefetches and it gets slow. Hmm, at a guess, it might be due to prefetching the dnode (root object for a file) and block pointers, ie the structure but not the data itself. > This is a bit bizarre, especially because the manifest=on cases are > still fast, exactly because the pread + prefetching still happens. I'm > sure users would find this puzzling. > > Unfortunately, the prefetching is not beneficial for all filesystems. > For XFS it does not seem to make any difference, but on BTRFS it seems > to cause a regression. > > I think this means we may need a "--prefetch" option, that'd force > prefetching, probably both before pread and copy_file_range. Otherwise > people on ZFS are doomed and will have poor performance. Seems reasonable if you can't fix it by tuning ZFS. (Might also be an interesting research topic for a potential ZFS patch: prefetch_swiss_cheese_window_size. I will not be nerd-sniped into reading the relevant source today, but I'll figure it out soonish...) > So I took a stab at this in 0007, which detects runs of blocks coming > from the same source file (limited to 128 blocks, i.e. 1MB). I only did > this for the copy_file_range() calls in 0007, and the results for XFS > look like this (complete results are in the PDF): > > old (block-by-block) new (batches) > ------------------------------------------------------ > 1% 150s 4s > 10% 150-200s 46s > 20% 150-200s 65s > > Yes, once again, those results are real, the backups are valid etc. So > not only it takes much less space (thanks to block alignment), it also > takes much less time (thanks to bulk operations). Again, fantastic.
On 4/1/24 23:45, Thomas Munro wrote: > ... >> >> I was very puzzled by the awful performance on ZFS. When every other fs >> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took >> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning >> advice I could think of, with almost no effect. >> >> Ultimately I decided that it probably is the "no readahead" behavior >> I've observed on ZFS. I assume it's because it doesn't use the page >> cache where the regular readahead is detected etc. And there's no >> prefetching in pg_combinebackup, so I decided to an experiment and added >> a trivial explicit prefetch when reconstructing the file - every time >> we'd read data from a file, we do posix_fadvise for up to 128 blocks >> ahead (similar to what bitmap heap scan code does). See 0002. >> >> And tadaaa - the duration dropped from 900-1000 seconds to only about >> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is >> pretty massive. > > Interesting. ZFS certainly has its own prefetching heuristics with > lots of logic and settings, but it could be that it's using > strict-next-block detection of access pattern (ie what I called > effective_io_readahead_window=0 in the streaming I/O thread) instead > of a window (ie like the Linux block device level read ahead where, > AFAIK, if you access anything in that sliding window it is triggered), > and perhaps your test has a lot of non-contiguous but close-enough > blocks? (Also reminds me of the similar discussion on the BHS thread > about distinguishing sequential access from > mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the > fine line between them.) > I don't think the files have a lot of non-contiguous but close-enough blocks (it's rather that we'd skip blocks that need to come from a later incremental file). The backups are generated to have a certain fraction of modified blocks. For example the smallest backup has 1% means 99% of blocks comes from the base backup, and 1% comes from the increment. And indeed, the whole database is ~75GB and the backup is ~740MB. Which means that on average there will be runs of 99 blocks in the base backup, then skip 1 block (to come from the increment), and then again 99-1-99-1. So it's very sequential, almost no holes, and the increment is 100% sequential. And it still does not seem to prefetch anything. > You could double-check this and related settings (for example I think > it might disable itself automatically if you're on a VM with small RAM > size): > > https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable > I haven't touched that parameter at all, and it's "enabled" by default: # cat /sys/module/zfs/parameters/zfs_prefetch_disable 0 While trying to make the built-in prefetch work I reviewed the other parameters with the "prefetch" tag, without success. And I haven't seen any advice on how to make it work ... >> There's a couple more interesting ZFS details - the prefetching seems to >> be necessary even when using copy_file_range() and don't need to read >> the data (to calculate checksums). This is why the "manifest=off" chart >> has the strange group of high bars at the end - the copy cases are fast >> because prefetch happens, but if we switch to copy_file_range() there >> are no prefetches and it gets slow. > > Hmm, at a guess, it might be due to prefetching the dnode (root object > for a file) and block pointers, ie the structure but not the data > itself. > Yeah, that's possible. But the effects are the same - it doesn't matter what exactly is not prefetched. But perhaps we could prefetch just a tiny part of the record, enough to prefetch the dnode+pointers, not the whole record. Might save some space in ARC, perhaps? >> This is a bit bizarre, especially because the manifest=on cases are >> still fast, exactly because the pread + prefetching still happens. I'm >> sure users would find this puzzling. >> >> Unfortunately, the prefetching is not beneficial for all filesystems. >> For XFS it does not seem to make any difference, but on BTRFS it seems >> to cause a regression. >> >> I think this means we may need a "--prefetch" option, that'd force >> prefetching, probably both before pread and copy_file_range. Otherwise >> people on ZFS are doomed and will have poor performance. > > Seems reasonable if you can't fix it by tuning ZFS. (Might also be an > interesting research topic for a potential ZFS patch: > prefetch_swiss_cheese_window_size. I will not be nerd-sniped into > reading the relevant source today, but I'll figure it out soonish...) > It's entirely possible I'm just too stupid and it works just fine for everyone else. But maybe not, and I'd say an implementation that is this difficult to configure is almost as if it didn't exist at all. The linux read-ahead works by default pretty great. So I don't see how to make this work without explicit prefetch ... Of course, we could also do no prefetch and tell users it's up to ZFS to make this work, but I don't think it does them any service. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > I've been running some benchmarks and experimenting with various stuff, > trying to improve the poor performance on ZFS, and the regression on XFS > when using copy_file_range. And oh boy, did I find interesting stuff ... [..] Congratulations on great results! > 4) after each pg_combinebackup run to pg_verifybackup, start the cluster > to finish recovery, run pg_checksums --check (to check the patches don't > produce something broken) I've performed some follow-up small testing on all patches mentioned here (1..7), with the earlier developed nano-incremental-backup-tests that helped detect some issues for Robert earlier during original development. They all went fine in both cases: - no special options when using pg_combinebackup - using pg_combinebackup --copy-file-range --manifest-checksums=NONE Those were: test_across_wallevelminimal.sh test_full_pri__incr_stby__restore_on_pri.sh test_full_pri__incr_stby__restore_on_stby.sh test_full_stby__incr_stby__restore_on_pri.sh test_full_stby__incr_stby__restore_on_stby.sh test_incr_after_timelineincrease.sh test_incr_on_standby_after_promote.sh test_many_incrementals_dbcreate_duplicateOID.sh test_many_incrementals_dbcreate_filecopy_NOINCR.sh test_many_incrementals_dbcreate_filecopy.sh test_many_incrementals_dbcreate.sh test_many_incrementals.sh test_multixact.sh test_pending_2pc.sh test_reindex_and_vacuum_full.sh test_repro_assert_RP.sh test_repro_assert.sh test_standby_incr_just_backup.sh test_stuck_walsum.sh test_truncaterollback.sh test_unlogged_table.sh > Now to the findings .... > > > 1) block alignment [..] > And I think we probably want to do this now, because this affects all > tools dealing with incremental backups - even if someone writes a custom > version of pg_combinebackup, it will have to deal with misaligned data. > Perhaps there might be something like pg_basebackup that "transforms" > the data received from the server (and also the backup manifest), but > that does not seem like a great direction. If anything is on the table, then I think in the far future pg_refresh_standby_using_incremental_backup_from_primary would be the only other tool using the format ? > 2) prefetch > ----------- [..] > I think this means we may need a "--prefetch" option, that'd force > prefetching, probably both before pread and copy_file_range. Otherwise > people on ZFS are doomed and will have poor performance. Right, we could optionally cover in the docs later-on various options to get the performance (on XFS use $this, but without $that and so on). It's kind of madness dealing with all those performance variations. Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a getopt variant like: --prefetch[=HOWMANY] with 128 being as default ? -J.
Hi, Here's a much more polished and cleaned up version of the patches, fixing all the issues I've been aware of, and with various parts merged into much more cohesive parts (instead of keeping them separate to make the changes/evolution more obvious). I decided to reorder the changes like this: 1) block alignment - As I said earlier, I think we should definitely do this, even if only to make future improvements possible. After chatting about this with Robert off-list a bit, he pointed out I actually forgot to not align the headers for files without any blocks, so this version fixes that. 2) add clone/copy_file_range for the case that copies whole files. This is pretty simple, but it also adds the --clone/copy-file-range options, and similar infrastructure. The one slightly annoying bit is that we now have the ifdef stuff in two places - when parsing the options, and then in the copy_file_XXX methods, and the pg_fatal() calls should not be reachable in practice. But that doesn't seem harmful, and might be a useful protection against someone calling function that does nothing. This also merges the original two parts, where the first one only did this cloning/CoW stuff when checksum did not need to be calculated, and the second extended it to those cases too (by also reading the data, but still doing the copy the old way). I think this is the right way - if that's not desisable, it's easy to either add --no-manifest or not use the CoW options. Or just not change the checksum type. There's no other way. 3) add copy_file_range to write_reconstructed_file, by using roughly the same logic/reasoning as (2). If --copy-file-range is specified and a checksum should be calculated, the data is read for the checksum, but the copy is done using copy_file_range. I did rework the flow write_reconstructed_file() flow a bit, because tracking what exactly needs to be read/written in each of the cases (which copy method, zeroed block, checksum calculated) made the flow really difficult to follow. Instead I introduced a function to read/write a block, and call them from different places. I think this is much better, and it also makes the following part dealing with batches of blocks much easier / smaller change. 4) prefetching - This is mostly unrelated to the CoW stuff, but it has tremendous benefits, especially for ZFS. I haven't been able to tune ZFS to get decent performance, and ISTM it'd be rather unusable for backup purposes without this. 5) batching in write_reconstructed_file - This benefits especially the copy_file_range case, where I've seen it to yield massive speedups (see the message from Monday for better data). 6) batching for prefetch - Similar logic to (5), but for fadvise. This is the only part where I'm not entirely sure whether it actually helps or not. Needs some more analysis, but I'm including it for completeness. I do think the parts (1)-(4) are in pretty good shape, good enough to make them committable in a day or two. I see it mostly a matter of testing and comment improvements rather than code changes. (5) is in pretty good shape too, but I'd like to maybe simplify and refine the write_reconstructed_file changes a bit more. I don't think it's broken, but it feels a bit too cumbersome. Not sure about (6) yet. I changed how I think about this a bit - I don't really see the CoW copy methods as necessary faster than the regular copy (even though it can be with (5)). I think the main benefits are the space savings, enabled by patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without that I don't think the performance is an issue - everything has a cost. On 4/3/24 15:39, Jakub Wartak wrote: > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> I've been running some benchmarks and experimenting with various stuff, >> trying to improve the poor performance on ZFS, and the regression on XFS >> when using copy_file_range. And oh boy, did I find interesting stuff ... > > [..] > > Congratulations on great results! > >> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster >> to finish recovery, run pg_checksums --check (to check the patches don't >> produce something broken) > > I've performed some follow-up small testing on all patches mentioned > here (1..7), with the earlier developed nano-incremental-backup-tests > that helped detect some issues for Robert earlier during original > development. They all went fine in both cases: > - no special options when using pg_combinebackup > - using pg_combinebackup --copy-file-range --manifest-checksums=NONE > > Those were: > test_across_wallevelminimal.sh > test_full_pri__incr_stby__restore_on_pri.sh > test_full_pri__incr_stby__restore_on_stby.sh > test_full_stby__incr_stby__restore_on_pri.sh > test_full_stby__incr_stby__restore_on_stby.sh > test_incr_after_timelineincrease.sh > test_incr_on_standby_after_promote.sh > test_many_incrementals_dbcreate_duplicateOID.sh > test_many_incrementals_dbcreate_filecopy_NOINCR.sh > test_many_incrementals_dbcreate_filecopy.sh > test_many_incrementals_dbcreate.sh > test_many_incrementals.sh > test_multixact.sh > test_pending_2pc.sh > test_reindex_and_vacuum_full.sh > test_repro_assert_RP.sh > test_repro_assert.sh > test_standby_incr_just_backup.sh > test_stuck_walsum.sh > test_truncaterollback.sh > test_unlogged_table.sh > >> Now to the findings .... >> Thanks. Would be great if you could run this on the attached version of the patches, ideally for each of them independently, so make sure it doesn't get broken+fixed somewhere on the way. >> >> 1) block alignment > > [..] > >> And I think we probably want to do this now, because this affects all >> tools dealing with incremental backups - even if someone writes a custom >> version of pg_combinebackup, it will have to deal with misaligned data. >> Perhaps there might be something like pg_basebackup that "transforms" >> the data received from the server (and also the backup manifest), but >> that does not seem like a great direction. > > If anything is on the table, then I think in the far future > pg_refresh_standby_using_incremental_backup_from_primary would be the > only other tool using the format ? > Possibly, but I was thinking more about backup solutions using the same format, but doing the client-side differently. Essentially, something that would still use the server side to generate incremental backups, but replace pg_combinebackup to do this differently (stream the data somewhere else, index it somehow, or whatever). >> 2) prefetch >> ----------- > [..] >> I think this means we may need a "--prefetch" option, that'd force >> prefetching, probably both before pread and copy_file_range. Otherwise >> people on ZFS are doomed and will have poor performance. > > Right, we could optionally cover in the docs later-on various options > to get the performance (on XFS use $this, but without $that and so > on). It's kind of madness dealing with all those performance > variations. > Yeah, something like that. I'm not sure we want to talk too much about individual filesystems in our docs, because those things evolve over time too. And also this depends on how large the increment is. If it only modifies 1% of the blocks, then 99% will come from the full backup, and the sequential prefetch should do OK (well, maybe not on ZFS). But as the incremental backup gets larger / is more random, the prefetch is more and more important. > Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a > getopt variant like: --prefetch[=HOWMANY] with 128 being as default ? > I did think about that, but there's a dependency on the batching. If we're prefetching ~1MB of data, we may need to prefetch up to ~1MB ahead. Because otherwise we might want to read 1MB and only a tiny part of that would be prefetched. I was thinking maybe we could skip the sequential parts, but ZFS does need that. So I don't think we can just allow users to set arbitrary values, at least not without also tweaking the batch. Or maybe 1MB batches are too large, and we should use something smaller? I need to think about this a bit more ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
- v20240403-0001-Properly-align-blocks-in-incremental-backu.patch
- v20240403-0002-Allow-copying-files-using-clone-copy_file_.patch
- v20240403-0003-Use-copy_file_range-in-write_reconstructed.patch
- v20240403-0004-Prefetch-blocks-read-by-write_reconstructe.patch
- v20240403-0005-Try-copying-larger-chunks-of-data-from-the.patch
- v20240403-0006-Try-prefetching-larger-chunks-of-data-from.patch
On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > Here's a much more polished and cleaned up version of the patches, > fixing all the issues I've been aware of, and with various parts merged > into much more cohesive parts (instead of keeping them separate to make > the changes/evolution more obvious). OK, so three runs of incrementalbackupstests - as stated earlier - also passed with OK for v20240403 (his time even with --enable-casserts) pg_combinebackup flags tested were: 1) --copy-file-range --manifest-checksums=CRC32C 2) --copy-file-range --manifest-checksums=NONE 3) default, no flags (no copy-file-range) > I changed how I think about this a bit - I don't really see the CoW copy > methods as necessary faster than the regular copy (even though it can be > with (5)). I think the main benefits are the space savings, enabled by > patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without > that I don't think the performance is an issue - everything has a cost. I take i differently: incremental backups without CoW fs would be clearly : - inefficient in terms of RTO (SANs are often a key thing due to having fast ability to "restore" the clone rather than copying the data from somewhere else) - pg_basebackup without that would be unusuable without space savings (e.g. imagine daily backups @ 10+TB DWHs) > On 4/3/24 15:39, Jakub Wartak wrote: > > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: [..] > Thanks. Would be great if you could run this on the attached version of > the patches, ideally for each of them independently, so make sure it > doesn't get broken+fixed somewhere on the way. Those are semi-manual test runs (~30 min? per run), the above results are for all of them applied at once. So my take is all of them work each one does individually too. FWIW, I'm also testing your other offlist incremental backup corruption issue, but that doesnt seem to be related in any way to copy_file_range() patches here. > >> 2) prefetch > >> ----------- > > [..] > >> I think this means we may need a "--prefetch" option, that'd force > >> prefetching, probably both before pread and copy_file_range. Otherwise > >> people on ZFS are doomed and will have poor performance. > > > > Right, we could optionally cover in the docs later-on various options > > to get the performance (on XFS use $this, but without $that and so > > on). It's kind of madness dealing with all those performance > > variations. > > > > Yeah, something like that. I'm not sure we want to talk too much about > individual filesystems in our docs, because those things evolve over > time too. Sounds like Wiki then. BTW, after a quick review: could we in 05 have something like common value then (to keep those together via some .h?) #define BATCH_SIZE PREFETCH_TARGET ? -J.
On 4/4/24 12:25, Jakub Wartak wrote: > On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> Here's a much more polished and cleaned up version of the patches, >> fixing all the issues I've been aware of, and with various parts merged >> into much more cohesive parts (instead of keeping them separate to make >> the changes/evolution more obvious). > > OK, so three runs of incrementalbackupstests - as stated earlier - > also passed with OK for v20240403 (his time even with > --enable-casserts) > > pg_combinebackup flags tested were: > 1) --copy-file-range --manifest-checksums=CRC32C > 2) --copy-file-range --manifest-checksums=NONE > 3) default, no flags (no copy-file-range) > Thanks! >> I changed how I think about this a bit - I don't really see the CoW copy >> methods as necessary faster than the regular copy (even though it can be >> with (5)). I think the main benefits are the space savings, enabled by >> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without >> that I don't think the performance is an issue - everything has a cost. > > I take i differently: incremental backups without CoW fs would be clearly : > - inefficient in terms of RTO (SANs are often a key thing due to > having fast ability to "restore" the clone rather than copying the > data from somewhere else) > - pg_basebackup without that would be unusuable without space savings > (e.g. imagine daily backups @ 10+TB DWHs) > Right, although this very much depends on the backup scheme. If you only take incremental backups, and then also a full backup once in a while, the CoW stuff probably does not help much. The alignment (the only thing affecting basebackups) may allow deduplication, but that's all I think. If the scheme is more complex, and involves "merging" the increments into the full backup, then this does help a lot. It'd even be possible to cheaply clone instances this way, I think. But I'm not sure how often would people do that on the same volume, to benefit from the CoW. >> On 4/3/24 15:39, Jakub Wartak wrote: >>> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: > [..] >> Thanks. Would be great if you could run this on the attached version of >> the patches, ideally for each of them independently, so make sure it >> doesn't get broken+fixed somewhere on the way. > > Those are semi-manual test runs (~30 min? per run), the above results > are for all of them applied at once. So my take is all of them work > each one does individually too. > Cool, thanks. > FWIW, I'm also testing your other offlist incremental backup > corruption issue, but that doesnt seem to be related in any way to > copy_file_range() patches here. > Yes, that's entirely independent, happens with master too. >>>> 2) prefetch >>>> ----------- >>> [..] >>>> I think this means we may need a "--prefetch" option, that'd force >>>> prefetching, probably both before pread and copy_file_range. Otherwise >>>> people on ZFS are doomed and will have poor performance. >>> >>> Right, we could optionally cover in the docs later-on various options >>> to get the performance (on XFS use $this, but without $that and so >>> on). It's kind of madness dealing with all those performance >>> variations. >>> >> >> Yeah, something like that. I'm not sure we want to talk too much about >> individual filesystems in our docs, because those things evolve over >> time too. > > Sounds like Wiki then. > > BTW, after a quick review: could we in 05 have something like common > value then (to keep those together via some .h?) > > #define BATCH_SIZE PREFETCH_TARGET ? > Yes, that's one of the things I'd like to refine a bit more. Making it more consistent / clearer that these things are interdependent. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I have pushed the three patches of this series - the one that aligns blocks, and the two adding clone/copy_file_range to pg_combinebackup. The committed versions are pretty much the 2024/04/03 version, with various minor cleanups (e.g. I noticed the docs are still claiming the copy methods work only without checksum calculations, but that's no longer true). I also changed the parameter order to keep the dry_run and debug parameters last, it seems nicer this way. The buildfarm reported two compile-time problems, both of them entirely avoidable (reported by cfbot but I failed to notice that). Should have known better ... Anyway, with these patches committed, pg_combinebackup can use CoW stuff to combine backups cheaply (especially in disk-space terms). The first patch (block alignment) however turned out to be important even for non-CoW filesystems, in some cases. I did a lot of benchmarks with the standard block-by-block copying of data, and on a machine with SSD RAID storage the duration went from ~400 seconds for some runs to only about 150 seconds (with aligned blocks). My explanation is that with the misaligned blocks the RAID often has to access two devices to read a block, and the alignment makes that go away. In the attached PDF with results (duration.pdf), showing the duration of pg_combinebackup on an increment of a particular size (1%, 10% or 20%), this is visible as a green square on the right. Those columns are results relative to a baseline - which for "copy" is master before the block alignment patch, and for "copy_file_range" it's the 3-reconstruct (adding copy_file_range to combining blocks from increments). FWIW the last three columns are a comparison with prefetching enabled. There's a couple interesting observations from this, based on which I'm not going to try to get the remaining patches (batching and prefetching) into v17. It clearly needs more analysis to make the right tradeoff. From the table, I think it's clear that: 0) The impact of block alignment on RAID storage, with regular copy. 1) The batching (origina patch 0005) either does not help the regular copy, or it actually makes it slower. The PDF is a bit misleading because it seems to suggest the i5 machine is unaffected, while the xeon gets ~30% slower. But that's just an illusion - the comparison is to master, but the alignment patch made i5 about 2x faster. So it's 200% slower when compared to "current master" with the alignment patch. That's not great :-/ And also a bit strange - I would have expected the batching to help the simple copy too. I haven't looked into why this happens, so there's a chance I made some silly mistake, who knows. For the copy_file_range case the batching is usually very beneficial, sometimes reducing the duration to a fraction of the non-batched case. My interpretation is that (unless there's a bug in the patch) we may need two variants of that code - a non-batched one for regular copy, and a batched variant for copy_file_range. 2) The prefetching is not a huge improvement, at least not for these three filesystems (btrfs, ext4, xfs). From the color scale it might seem like it helps, but those values are relative to the baseline, so when the non-prefetching value is 5% and with prefetching 10%, that means the prefetching makes it slower. And that's very often true. This is visible more clearly in prefetching.pdf, comparing the non-prefetching and prefetching results for each patch, not to baseline. That's makes it quite clear there's a lot of "red" where prefetching makes it slower. It certainly does help for larger increments (which makes sense, because the modified blocks are distributed randomly, and thus come from random files, making long streaks unlikely). I've imagined the prefetching could be made a bit smarter to ignore the streaks (=sequential patterns), but once again - this only matters with the batching, which we don't have. And without the batching it looks like a net loss (that's the first column in the prefetching PDF). I did start thinking about prefetching because of ZFS, where it was necessary to get decent performance. And that's still true. But (a) even with the explicit prefetching it's still 2-3x slower than any of these filesystems, so I assume performance-sensitive use cases won't use it. And (b) the prefetching seems necessary in all cases, no matter how large the increment is. Which goes directly against the idea of looking at how random the blocks are and prefetching only the sufficiently random patterns. That doesn't seem like a great thing. 3) There's also the question of disk space usage. The size.pdf shows how the patches affect space needed for the pg_combinebackup result. It does depend a bit on the internal fs cleanup for each run, but it seems the batching makes a difference - clearly copying 1MB blocks instead of 8kB allows lower overhead for some filesystems (e.g. btrfs, where we get from ~1.5GB to a couple MBs). But the space savings are quite negligible compared to just using --copy-file-range option (where we get from 75GB to 1.5GB). I think the batching is interesting mostly because of the substantial duration reduction. I'm also attaching the benchmarking script I used (warning: ugly!), and results for the three filesystems. For ZFS I only have partial results so far, because it's so slow, but in general - without prefetching it's slow (~1000s) with prefetching it's better but still slow (~250s). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 4/5/24 21:43, Tomas Vondra wrote: > Hi, > > ... > > 2) The prefetching is not a huge improvement, at least not for these > three filesystems (btrfs, ext4, xfs). From the color scale it might seem > like it helps, but those values are relative to the baseline, so when > the non-prefetching value is 5% and with prefetching 10%, that means the > prefetching makes it slower. And that's very often true. > > This is visible more clearly in prefetching.pdf, comparing the > non-prefetching and prefetching results for each patch, not to baseline. > That's makes it quite clear there's a lot of "red" where prefetching > makes it slower. It certainly does help for larger increments (which > makes sense, because the modified blocks are distributed randomly, and > thus come from random files, making long streaks unlikely). > > I've imagined the prefetching could be made a bit smarter to ignore the > streaks (=sequential patterns), but once again - this only matters with > the batching, which we don't have. And without the batching it looks > like a net loss (that's the first column in the prefetching PDF). > > I did start thinking about prefetching because of ZFS, where it was > necessary to get decent performance. And that's still true. But (a) even > with the explicit prefetching it's still 2-3x slower than any of these > filesystems, so I assume performance-sensitive use cases won't use it. > And (b) the prefetching seems necessary in all cases, no matter how > large the increment is. Which goes directly against the idea of looking > at how random the blocks are and prefetching only the sufficiently > random patterns. That doesn't seem like a great thing. > I finally got a more complete ZFS results, and I also decided to get some numbers without the ZFS tuning I did. And boy oh boy ... All the tests I did with ZFS were tuned the way I've seen recommended when using ZFS for PostgreSQL, that is zfs set recordsize=8K logbias=throughput compression=none and this performed quite poorly - pg_combinebackup took 4-8x longer than with the traditional filesystems (btrfs, xfs, ext4) and the only thing that improved that measurably was prefetching. But once I reverted back to the default recordsize of 128kB the performance is waaaaaay better - entirely comparable to ext4/xfs, while btrfs remains faster with --copy-file-range --no-manigest (by a factor of 2-3x). This is quite clearly visible in the attached "current.pdf" which shows results for the current master (i.e. filtered to the 3-reconstruct patch adding CoW stuff to write_reconstructed_file). There's also some differences in the disk usage, where ZFS seems to need more space than xfs/btrfs (as if there was no block sharing), but maybe that's due to how I measure this using df ... I also tried also "completely default" ZFS configuration, with all options left at the default (recordsize=128kB, compression=lz4, and logbias=latency). That performs about the same, except that the disk usage is lower thanks to the compression. note: Because I'm hip cool kid, I also ran the tests on bcachefs. The results are included in the CSV/PDF attachments. In general it's much slower than xfs/btrfs/ext, and the disk space is somewhere in between btrfs and xfs (for the CoW cases). We'll see how this improves as it matures in the future. The attachments are tables with the total duration / disk space usage, and impact of prefetching. The tables are similar to what I shared before, except that the color scale is applied to the values directly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 4/7/24 19:46, Tomas Vondra wrote: > On 4/5/24 21:43, Tomas Vondra wrote: >> Hi, >> >> ... >> >> 2) The prefetching is not a huge improvement, at least not for these >> three filesystems (btrfs, ext4, xfs). From the color scale it might seem >> like it helps, but those values are relative to the baseline, so when >> the non-prefetching value is 5% and with prefetching 10%, that means the >> prefetching makes it slower. And that's very often true. >> >> This is visible more clearly in prefetching.pdf, comparing the >> non-prefetching and prefetching results for each patch, not to baseline. >> That's makes it quite clear there's a lot of "red" where prefetching >> makes it slower. It certainly does help for larger increments (which >> makes sense, because the modified blocks are distributed randomly, and >> thus come from random files, making long streaks unlikely). >> >> I've imagined the prefetching could be made a bit smarter to ignore the >> streaks (=sequential patterns), but once again - this only matters with >> the batching, which we don't have. And without the batching it looks >> like a net loss (that's the first column in the prefetching PDF). >> >> I did start thinking about prefetching because of ZFS, where it was >> necessary to get decent performance. And that's still true. But (a) even >> with the explicit prefetching it's still 2-3x slower than any of these >> filesystems, so I assume performance-sensitive use cases won't use it. >> And (b) the prefetching seems necessary in all cases, no matter how >> large the increment is. Which goes directly against the idea of looking >> at how random the blocks are and prefetching only the sufficiently >> random patterns. That doesn't seem like a great thing. >> > > I finally got a more complete ZFS results, and I also decided to get > some numbers without the ZFS tuning I did. And boy oh boy ... > > All the tests I did with ZFS were tuned the way I've seen recommended > when using ZFS for PostgreSQL, that is > > zfs set recordsize=8K logbias=throughput compression=none > > and this performed quite poorly - pg_combinebackup took 4-8x longer than > with the traditional filesystems (btrfs, xfs, ext4) and the only thing > that improved that measurably was prefetching. > > But once I reverted back to the default recordsize of 128kB the > performance is waaaaaay better - entirely comparable to ext4/xfs, while > btrfs remains faster with --copy-file-range --no-manigest (by a factor > of 2-3x). > I forgot to explicitly say that I think confirms the decision to not push the patch adding the explicit prefetching to pg_combinebackup. It's not needed/beneficial even for ZFS, when using a suitable configuration. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company