Обсуждение: pg_upgrade --copy-file-range

Поиск
Список
Период
Сортировка

pg_upgrade --copy-file-range

От
Thomas Munro
Дата:
Hello,

I was just in a pg_upgrade unconference session at PGCon where the
lack of $SUBJECT came up.  This system call gives the kernel the
option to use fast block cloning on XFS, ZFS (as of very recently),
etc, and works on Linux and FreeBSD.  It's probably much the same as
--clone mode on COW file systems, except that is Linux-only.  On
overwrite file systems (ie not copy-on-write, like ext4), it may also
be able to push copies down to storage hardware/network file systems.

There was something like this in the nearby large files patch set, but
in that version it just magically did it when available in --copy
mode.  Now I think the user should have to have to opt in with
--copy-file-range, and simply to error out if it fails.  It may not
work in some cases -- for example, the man page says that older Linux
systems can fail with EXDEV when you try to copy across file systems,
while newer systems will do something less efficient but still
sensible internally; also I saw a claim that some older versions had
weird bugs.  Better to just expose the raw functionality and let users
say when they want it and read the error if it fail, I think.

Вложения

Re: pg_upgrade --copy-file-range

От
Peter Eisentraut
Дата:
On 02.06.23 21:30, Thomas Munro wrote:
> I was just in a pg_upgrade unconference session at PGCon where the
> lack of $SUBJECT came up.  This system call gives the kernel the
> option to use fast block cloning on XFS, ZFS (as of very recently),
> etc, and works on Linux and FreeBSD.  It's probably much the same as
> --clone mode on COW file systems, except that is Linux-only.  On
> overwrite file systems (ie not copy-on-write, like ext4), it may also
> be able to push copies down to storage hardware/network file systems.
> 
> There was something like this in the nearby large files patch set, but
> in that version it just magically did it when available in --copy
> mode.  Now I think the user should have to have to opt in with
> --copy-file-range, and simply to error out if it fails.  It may not
> work in some cases -- for example, the man page says that older Linux
> systems can fail with EXDEV when you try to copy across file systems,
> while newer systems will do something less efficient but still
> sensible internally; also I saw a claim that some older versions had
> weird bugs.  Better to just expose the raw functionality and let users
> say when they want it and read the error if it fail, I think.

When we added --clone, copy_file_range() was available, but the problem 
was that it was hard for the user to predict whether you'd get the fast 
clone behavior or the slow copy behavior.  That's the kind of thing you 
want to know when planning and testing your upgrade.  At the time, there 
were patches passed around in Linux kernel circles that would have been 
able to enforce cloning via the flags argument of copy_file_range(), but 
that didn't make it to the mainline.

So, yes, being able to specify exactly which copy mechanism to use makes 
sense, so that users can choose the tradeoffs.

About your patch:

I think you should have a "check" function called from 
check_new_cluster().  That check function can then also handle the "not 
supported" case, and you don't need to handle that in 
parseCommandLine().  I suggest following the clone example for these, 
since the issues there are very similar.




Re: pg_upgrade --copy-file-range

От
Thomas Munro
Дата:
On Mon, Jul 3, 2023 at 7:47 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> When we added --clone, copy_file_range() was available, but the problem
> was that it was hard for the user to predict whether you'd get the fast
> clone behavior or the slow copy behavior.  That's the kind of thing you
> want to know when planning and testing your upgrade.  At the time, there
> were patches passed around in Linux kernel circles that would have been
> able to enforce cloning via the flags argument of copy_file_range(), but
> that didn't make it to the mainline.
>
> So, yes, being able to specify exactly which copy mechanism to use makes
> sense, so that users can choose the tradeoffs.

Thanks for looking.  Yeah, it is quite inconvenient for planning
purposes that it is hard for a user to know which internal strategy it
uses, but that's the interface we have (and clearly "flags" is
reserved for future usage so that might still evolve..).

> About your patch:
>
> I think you should have a "check" function called from
> check_new_cluster().  That check function can then also handle the "not
> supported" case, and you don't need to handle that in
> parseCommandLine().  I suggest following the clone example for these,
> since the issues there are very similar.

Done.

Вложения

Re: pg_upgrade --copy-file-range

От
Peter Eisentraut
Дата:
On 08.10.23 07:15, Thomas Munro wrote:
>> About your patch:
>>
>> I think you should have a "check" function called from
>> check_new_cluster().  That check function can then also handle the "not
>> supported" case, and you don't need to handle that in
>> parseCommandLine().  I suggest following the clone example for these,
>> since the issues there are very similar.
> 
> Done.

This version looks good to me.

Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a 
different suffix.



Re: pg_upgrade --copy-file-range

От
Peter Eisentraut
Дата:
On 13.11.23 08:15, Peter Eisentraut wrote:
> On 08.10.23 07:15, Thomas Munro wrote:
>>> About your patch:
>>>
>>> I think you should have a "check" function called from
>>> check_new_cluster().  That check function can then also handle the "not
>>> supported" case, and you don't need to handle that in
>>> parseCommandLine().  I suggest following the clone example for these,
>>> since the issues there are very similar.
>>
>> Done.
> 
> This version looks good to me.
> 
> Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a 
> different suffix.

Thomas, are you planning to proceed with this patch?




Re: pg_upgrade --copy-file-range

От
Thomas Munro
Дата:
On Sat, Dec 23, 2023 at 9:40 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 13.11.23 08:15, Peter Eisentraut wrote:
> > On 08.10.23 07:15, Thomas Munro wrote:
> >>> About your patch:
> >>>
> >>> I think you should have a "check" function called from
> >>> check_new_cluster().  That check function can then also handle the "not
> >>> supported" case, and you don't need to handle that in
> >>> parseCommandLine().  I suggest following the clone example for these,
> >>> since the issues there are very similar.
> >>
> >> Done.
> >
> > This version looks good to me.
> >
> > Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a
> > different suffix.
>
> Thomas, are you planning to proceed with this patch?

Yes.  Sorry for being slow... got stuck working on an imminent new
version of streaming read.  I will be defrosting my commit bit and
committing this one and a few things shortly.

As it happens I was just thinking about this particular patch because
I suddenly had a strong urge to teach pg_combinebackup to use
copy_file_range.  I wonder if you had the same idea...



Re: pg_upgrade --copy-file-range

От
Michael Paquier
Дата:
On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:
> As it happens I was just thinking about this particular patch because
> I suddenly had a strong urge to teach pg_combinebackup to use
> copy_file_range.  I wonder if you had the same idea...

Yeah, +1.  That would make copy_file_blocks() more efficient where the
code is copying 50 blocks in batches because it needs to reassign
checksums to the blocks copied.
--
Michael

Вложения

Re: pg_upgrade --copy-file-range

От
Jakub Wartak
Дата:
Hi Thomas, Michael, Peter and -hackers,

On Sun, Dec 24, 2023 at 3:57 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:
> > As it happens I was just thinking about this particular patch because
> > I suddenly had a strong urge to teach pg_combinebackup to use
> > copy_file_range.  I wonder if you had the same idea...
>
> Yeah, +1.  That would make copy_file_blocks() more efficient where the
> code is copying 50 blocks in batches because it needs to reassign
> checksums to the blocks copied.

I've tried to achieve what you were discussing. Actually this was my
first thought when using pg_combinebackup with larger (realistic)
backup sizes back in December. Attached is a set of very DIRTY (!)
patches that provide CoW options (--clone/--copy-range-file) to
pg_combinebackup (just like pg_upgrade to keep it in sync), while also
refactoring some related bits of code to avoid duplication.

With XFS (with reflink=1 which is default) on Linux with kernel 5.10
and ~210GB backups, I'm getting:

root@jw-test-1:/xfs# du -sm *
210229  full
250     incr.1

Today in master, the old classic read()/while() loop without
CoW/reflink optimization :
root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE -o outtest full incr.1
3

real    49m43.963s
user    0m0.887s
sys     2m52.697s

VS patch with "--clone" :

root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE --clone -o outtest full incr.1
3

real    0m39.812s
user    0m0.325s
sys     0m2.401s

So it is 49mins down to 40 seconds(!) +/-10s (3 tries) if the FS
supports CoW/reflinks (XFS, BTRFS, upcoming bcachefs?). It looks to me
that this might mean that if one actually wants to use incremental
backups (to get minimal RTO), it would be wise to only use CoW
filesystems from the start so that RTO is as low as possible.

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.

Any help is appreciated.

-J.

Вложения

Re: pg_upgrade --copy-file-range

От
Peter Eisentraut
Дата:
On 05.01.24 13:40, Jakub Wartak wrote:
> Random patch notes:
> - main meat is in v3-0002*, I hope i did not screw something seriously
> - in worst case: it is opt-in through switch, so the user always can
> stick to the classic copy
> - no docs so far
> - pg_copyfile_offload_supported() should actually be fixed if it is a
> good path forward
> - pgindent actually indents larger areas of code that I would like to,
> any ideas or is it ok?
> - not tested on Win32/MacOS/FreeBSD
> - i've tested pg_upgrade manually and it seems to work and issue
> correct syscalls, however some tests are failing(?). I haven't
> investigated why yet due to lack of time.

Something is wrong with the pgindent in your patch set.  Maybe you used 
a wrong version.  You should try to fix that, because it is hard to 
process your patch with that amount of unrelated reformatting.

As far as I can tell, the original pg_upgrade patch has been ready to 
commit since October.  Unless Thomas has any qualms that have not been 
made explicit in this thread, I suggest we move ahead with that.

And then Jakub could rebase his patch set on top of that.  It looks like 
if the formatting issues are fixed, the remaining pg_combinebackup 
support isn't that big.




Re: pg_upgrade --copy-file-range

От
Thomas Munro
Дата:
On Wed, Mar 6, 2024 at 2:43 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> As far as I can tell, the original pg_upgrade patch has been ready to
> commit since October.  Unless Thomas has any qualms that have not been
> made explicit in this thread, I suggest we move ahead with that.

pg_upgrade --copy-file-range pushed.  The only change I made was to
remove the EINTR retry condition which was subtly wrong and actually
not needed here AFAICS.  (Erm, maybe I did have an unexpressed qualm
about some bug reports unfolding around that time about corruption
linked to copy_file_range that might have spooked me but those seem to
have been addressed.)

> And then Jakub could rebase his patch set on top of that.  It looks like
> if the formatting issues are fixed, the remaining pg_combinebackup
> support isn't that big.

+1

I'll also go and rebase CREATE DATABASE ... STRATEGY=file_clone[1].

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com



Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:
Hi,

I took a quick look at the remaining part adding copy_file_range to
pg_combinebackup. The patch no longer applies, so I had to rebase it.
Most of the issues were trivial, but I had to fix a couple missing
prototypes - I added them to copy_file.h/c, mostly.

0001 is the minimal rebase + those fixes

0002 has a couple review comments in copy_file, and it also undoes a lot
of unnecessary formatting changes (already pointed out by Peter a couple
days ago).

A couple review comments:

1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.

2) I wonder if we even need opt_errinfo(). I'm not sure it actually
makes anything simpler.

3) I think it'd be nice to make CopyFileMethod more consistent with
transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
more consistent, it's probably not worth unifying this somehow).

4) I wonder how we came up with copying the files by 50 blocks, but I
now realize it's been like this before this patch. I only noticed
because the patch adds a comment before buffer_size calculation.

5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
name is way more generic / less descriptive - it's clear it copies the
file block by block (well, in chunks). pg_copyfile is pretty vague.

6) This leaves behind copy_file_copyfile, which is now unused.

7) The patch reworks how combinebackup deals with alternative copy
implementations - instead of setting strategy_implementation and calling
that, the decisions now happen in pg_copyfile_offload with a lot of
conditions / ifdef / defined ... I find it pretty hard to understand and
reason about. I liked the strategy_implementation approach, as it forces
us to keep each method in a separate function.

Perhaps there's a reason why that doesn't work for copy_file_range? But
in that case this needs much clearer comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_upgrade --copy-file-range

От
Jakub Wartak
Дата:
Hi Tomas,

> I took a quick look at the remaining part adding copy_file_range to
> pg_combinebackup. The patch no longer applies, so I had to rebase it.
> Most of the issues were trivial, but I had to fix a couple missing
> prototypes - I added them to copy_file.h/c, mostly.
>
> 0001 is the minimal rebase + those fixes
>
> 0002 has a couple review comments in copy_file, and it also undoes a lot
> of unnecessary formatting changes (already pointed out by Peter a couple
> days ago).
>

Thank you very much for this! As discussed privately, I'm not in
position right now to pursue this further at this late stage (at least
for v17, which would require an aggressive schedule ). My plan was
more for v18 after Peter's email, due to other obligations. But if you
have cycles and want to continue, please do so without hesitation -
I'll try to chime in a long way to test and review for sure.

> A couple review comments:
>
> 1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.
>
> 2) I wonder if we even need opt_errinfo(). I'm not sure it actually
> makes anything simpler.

Yes, as it stands it's broken (somewhat I've missed gcc warning),
should be pg_malloc(). I hardly remember, but I wanted to avoid code
duplication. No strong opinion, maybe that's a different style, I'll
adapt as necessary.

> 3) I think it'd be nice to make CopyFileMethod more consistent with
> transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
> more consistent, it's probably not worth unifying this somehow).
>
> 4) I wonder how we came up with copying the files by 50 blocks, but I
> now realize it's been like this before this patch. I only noticed
> because the patch adds a comment before buffer_size calculation.

It looks like it was like that before pg_upgrade even was moved into
the core. 400kB is indeed bit strange value, so we can leave it as it
is or make the COPY_BUF_SIZ 128kb - see [1] (i've double checked cp(1)
uses still 128kB today), or maybe just stick to something like 256 or
512 kBs.

> 5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
> name is way more generic / less descriptive - it's clear it copies the
> file block by block (well, in chunks). pg_copyfile is pretty vague.
>
> 6) This leaves behind copy_file_copyfile, which is now unused.
>
> 7) The patch reworks how combinebackup deals with alternative copy
> implementations - instead of setting strategy_implementation and calling
> that, the decisions now happen in pg_copyfile_offload with a lot of
> conditions / ifdef / defined ... I find it pretty hard to understand and
> reason about. I liked the strategy_implementation approach, as it forces
> us to keep each method in a separate function.

Well some context (maybe it was my mistake to continue in this
./thread rather starting a new one): my plan was 3-in-1: in the
original proposal (from Jan) to provide CoW as generic facility for
other to use - in src/common/file_utils.c as per
v3-0002-Confine-various-OS-copy-on-write-and-other-copy-a.patch - to
unify & confine CoW methods and their quirkiness between
pg_combinebackup and pg_upgrade and other potential CoW uses too. That
was before Thomas M. pushed CoW just for pg_upgrade as
d93627bcbe5001750e7611f0e637200e2d81dcff. I had this idea back then to
have pg_copyfile() [normal blocks copy] and
pg_copyfile_offload_supported(),
pg_copyfile_offload(PG_COPYFILE_IOCTL_FICLONE ,
PG_COPYFILE_COPY_FILE_RANGE,
PG_COPYFILE_who_has_idea_what_they_come_up_with_in_future). In Your's
version of the patch it's local to pg_combinebackup, so it might make
no sense after all. If you look at the pg_upgrade and pg_combinebackup
they both have code duplication with lots of those ifs/IFs (assuming
user wants to have it as drop-in [--clone/--copy/--copyfile] and
platform may / may not have it). I've even considered
--cow=ficlone|copy_file_range to sync both tools from CLI arguments
point of view, but that would break backwards compatibility, so I did
not do that.

Also there's a problem with pg_combinebackup's strategy_implementation
that it actually cannot on its own decide (I think) which CoW to use
or not. There were some longer discussions that settled on one thing
(for pg_upgrade): it's the user who is in control HOW the copy gets
done (due to potential issues in OS CoW() implementations where e.g.
if NFS would be involved on one side). See pg_upgrade
--clone/--copy/--copy-file-range/--sync-method options. I wanted to
stick to that, so pg_combinebackup also needs to give the same options
to the user.

That's was for the historical context, now you wrote "it's probably
not worth unifying this somehow" few sentences earlier, so my take is
the following: we can just concentrate on getting the
copy_file_range() and ioctl_ficlone to pg_combinebackup at the price
of duplicating some complexity for now (in short to start with clear
plate , it doesn't necessary needs to be my patch as base if we think
it's worthwhile for v17 - or stick to your reworked patch of mine).

Later (v18?) some bigger than this refactor could unify and move the
copy methods to some more central place (so then we would have sync as
there would be no doubling like you mentioned e.g.: pg_upgrade's enum
transferMode <-> patch enum CopyFileMethod.

So for now I'm +1 to renaming all the things as you want -- indeed
pg_copy* might not be a good fit in a localized version.

-J.

[1] - https://eklitzke.org/efficient-file-copying-on-linux



Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:
Here's a patch reworked along the lines from a couple days ago.

The primary goals were to add clone/copy_file_range while minimizing
unnecessary disruption, and overall cleanup of the patch. I'm not saying
it's committable, but I think the patch is much easier to understand.

The main change is that this abandons the idea of handling all possible
cases in a single function that looks like a maze of ifdefs, and instead
separates each case into it's own function and the decision happens much
earlier. This is pretty much exactly what pg_upgrade does, BTW.

There's maybe an argument that these functions could be unified and
moved to a library in src/common - I can imagine doing that, but I don't
think it's required. The functions are pretty trivial wrappers, and it's
not like we expect many more callers. And there's probably stuff we'd
need to keep out of that library (e.g. the decision which copy/clone
methods are available / should be used or error reporting). So it
doesn't seem worth it, at least for now.

There's one question, though. As it stands, there's a bit of asymmetry
between handling CopyFile() on WIN32 and the clone/copy_file_range on
other platforms). On WIN32, we simply automatically switch to CopyFile
automatically, if we don't need to calculate checksum. But with the
other methods, error out if the user requests those and we need to
calculate the checksum.

The asymmetry comes from the fact there's no option to request CopyFile
on WIN32, and we feel confident it's always the right thing to do (safe,
faster). We don't seem to know that for the other methods, so the user
has to explicitly request those. And if the user requests --clone, for
example, it'd be wrong to silently fallback to plain copy.

Still, I wonder if this might cause some undesirable issues during
restores. But I guess that's why we have --dry-run.

This asymmetry also shows a bit in the code - the CopyFile is coded and
called a bit differently from the other methods. FWIW I abandoned the
approach with "strategy" and just use a switch on CopyMode enum, just
like pg_upgrade does.

There's a couple more smaller changes:

- Addition of docs for --clone/--copy-file-range (shameless copy from
pg_upgrade docs).

- Removal of opt_errinfo - not only was it buggy, I think the code is
actually cleaner without it.

- Removal of EINTR retry condition from copy_file_range handling (this
is what Thomas ended up for pg_upgrade while committing that part).

Put together, this cuts the patch from ~40kB to ~15kB (most of this is
due to the cleanup of unnecessary whitespace changes, though).

I think to make this committable, this requires some review and testing,
ideally on a range of platforms.

One open question is how to allow testing this. For pg_upgrade we now
have PG_TEST_PG_UPGRADE_MODE, which can be set to e.g. "--clone". I
wonder if we should add PG_TEST_PG_COMBINEBACKUP_MODE ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_upgrade --copy-file-range

От
Robert Haas
Дата:
On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> There's one question, though. As it stands, there's a bit of asymmetry
> between handling CopyFile() on WIN32 and the clone/copy_file_range on
> other platforms). On WIN32, we simply automatically switch to CopyFile
> automatically, if we don't need to calculate checksum. But with the
> other methods, error out if the user requests those and we need to
> calculate the checksum.

That seems completely broken. copy_file() needs to have the ability to
calculate a checksum if one is required; when one isn't required, it
can do whatever it likes. So we should always fall back to the
block-by-block method if we need a checksum. Whatever option the user
specified should only be applied when we don't need a checksum.

Consider, for example:

pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
pg_basebackup -D monday -c fast --manifest-checksums=SHA224
--incremental sunday/backup_manifest
pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone

Any files that are copied in their entirety from Sunday's backup can
be cloned, if we have support for cloning. But any files copied from
Monday's backup will need to be re-checksummed, since the checksum
algorithms don't match. With what you're describing, it sounds like
pg_combinebackup would just fail in this case; I don't think that's
the behavior that the user is going to want.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:
On 3/22/24 17:42, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> There's one question, though. As it stands, there's a bit of asymmetry
>> between handling CopyFile() on WIN32 and the clone/copy_file_range on
>> other platforms). On WIN32, we simply automatically switch to CopyFile
>> automatically, if we don't need to calculate checksum. But with the
>> other methods, error out if the user requests those and we need to
>> calculate the checksum.
> 
> That seems completely broken. copy_file() needs to have the ability to
> calculate a checksum if one is required; when one isn't required, it
> can do whatever it likes. So we should always fall back to the
> block-by-block method if we need a checksum. Whatever option the user
> specified should only be applied when we don't need a checksum.
> 
> Consider, for example:
> 
> pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
> pg_basebackup -D monday -c fast --manifest-checksums=SHA224
> --incremental sunday/backup_manifest
> pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone
> 
> Any files that are copied in their entirety from Sunday's backup can
> be cloned, if we have support for cloning. But any files copied from
> Monday's backup will need to be re-checksummed, since the checksum
> algorithms don't match. With what you're describing, it sounds like
> pg_combinebackup would just fail in this case; I don't think that's
> the behavior that the user is going to want.
> 

Right, this will happen:

  pg_combinebackup: error: unable to use accelerated copy when manifest
  checksums are to be calculated. Use --no-manifest

Are you saying we should just silently override the copy method and do
the copy block by block? I'm not strongly opposed to that, but it feels
wrong to just ignore that the user explicitly requested cloning, and I'm
not sure why should this be different from any other case when the user
requests incompatible combination of options and/or options that are not
supported on the current configuration.

Why not just to tell the user to use the correct parameters, i.e. either
remove --clone or add --no-manifest?

FWIW I now realize it actually fails a bit earlier than I thought - when
parsing the options, not in copy_file. But then some checks (if a given
copy method is supported) happen in the copy functions. I wonder if it'd
be better/possible to do all of that in one place, not sure.

Also, the message only suggests to use --no-manifest. It probably should
suggest removing --clone too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_upgrade --copy-file-range

От
Robert Haas
Дата:
On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Right, this will happen:
>
>   pg_combinebackup: error: unable to use accelerated copy when manifest
>   checksums are to be calculated. Use --no-manifest
>
> Are you saying we should just silently override the copy method and do
> the copy block by block?

Yes.

> I'm not strongly opposed to that, but it feels
> wrong to just ignore that the user explicitly requested cloning, and I'm
> not sure why should this be different from any other case when the user
> requests incompatible combination of options and/or options that are not
> supported on the current configuration.

I don't feel like copying block-by-block when that's needed to compute
a checksum is ignoring what the user requested. I mean, if we'd had to
perform reconstruction rather than copying an entire file, we would
have done that regardless of whether --clone had been there, and I
don't see why the need-to-compute-a-checksum case is any different. I
think we should view a flag like --clone as specifying how to copy a
file when we don't need to do anything but copy it. I don't think it
should dictate that we're not allowed to perform other processing when
that other processing is required.

From my point of view, this is not a case of incompatible options
having been specified. If you specify run pg_basebackup with both
--format=p and --format=t, those are incompatible options; the backup
can be done one way or the other, but not both at once. But here
there's no such conflict. Block-by-block copying and fast-copying can
happen as part of the same operation, as in the example that I showed,
where some files need the block-by-block copying and some can be
fast-copied. The user is entitled to specify which fast-copying method
they would like to have used for the files where fast-copying is
possible without getting a failure just because it isn't possible for
every single file.

Or to say it the other way around, if there's 1 file that needs to be
copied block by block and 99 files that can be fast-copied, you want
to force the user to the block-by-block method for all 100 files. I
want to force the use of the block-by-block method for the 1 file
where that's the only valid method, and let them choose what they want
to do for the other 99.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_upgrade --copy-file-range

От
Thomas Munro
Дата:
Hmm, this discussion seems to assume that we only use
copy_file_range() to copy/clone whole segment files, right?  That's
great and may even get most of the available benefit given typical
databases with many segments of old data that never changes, but... I
think copy_write_range() allows us to go further than the other
whole-file clone techniques: we can stitch together parts of an old
backup segment file and an incremental backup to create a new file.
If you're interested in minimising disk use while also removing
dependencies on the preceding chain of backups, then it might make
sense to do that even if you *also* have to read the data to compute
the checksums, I think?  That's why I mentioned it: if
copy_file_range() (ie sub-file-level block sharing) is a solution in
search of a problem, has the world ever seen a better problem than
pg_combinebackup?



Re: pg_upgrade --copy-file-range

От
Robert Haas
Дата:
On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Hmm, this discussion seems to assume that we only use
> copy_file_range() to copy/clone whole segment files, right?  That's
> great and may even get most of the available benefit given typical
> databases with many segments of old data that never changes, but... I
> think copy_write_range() allows us to go further than the other
> whole-file clone techniques: we can stitch together parts of an old
> backup segment file and an incremental backup to create a new file.
> If you're interested in minimising disk use while also removing
> dependencies on the preceding chain of backups, then it might make
> sense to do that even if you *also* have to read the data to compute
> the checksums, I think?  That's why I mentioned it: if
> copy_file_range() (ie sub-file-level block sharing) is a solution in
> search of a problem, has the world ever seen a better problem than
> pg_combinebackup?

That makes sense; it's just a different part of the code than I
thought we were talking about.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:

On 3/22/24 19:40, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Right, this will happen:
>>
>>   pg_combinebackup: error: unable to use accelerated copy when manifest
>>   checksums are to be calculated. Use --no-manifest
>>
>> Are you saying we should just silently override the copy method and do
>> the copy block by block?
> 
> Yes.
> 
>> I'm not strongly opposed to that, but it feels
>> wrong to just ignore that the user explicitly requested cloning, and I'm
>> not sure why should this be different from any other case when the user
>> requests incompatible combination of options and/or options that are not
>> supported on the current configuration.
> 
> I don't feel like copying block-by-block when that's needed to compute
> a checksum is ignoring what the user requested. I mean, if we'd had to
> perform reconstruction rather than copying an entire file, we would
> have done that regardless of whether --clone had been there, and I
> don't see why the need-to-compute-a-checksum case is any different. I
> think we should view a flag like --clone as specifying how to copy a
> file when we don't need to do anything but copy it. I don't think it
> should dictate that we're not allowed to perform other processing when
> that other processing is required.
> 
> From my point of view, this is not a case of incompatible options
> having been specified. If you specify run pg_basebackup with both
> --format=p and --format=t, those are incompatible options; the backup
> can be done one way or the other, but not both at once. But here
> there's no such conflict. Block-by-block copying and fast-copying can
> happen as part of the same operation, as in the example that I showed,
> where some files need the block-by-block copying and some can be
> fast-copied. The user is entitled to specify which fast-copying method
> they would like to have used for the files where fast-copying is
> possible without getting a failure just because it isn't possible for
> every single file.
> 
> Or to say it the other way around, if there's 1 file that needs to be
> copied block by block and 99 files that can be fast-copied, you want
> to force the user to the block-by-block method for all 100 files. I
> want to force the use of the block-by-block method for the 1 file
> where that's the only valid method, and let them choose what they want
> to do for the other 99.
> 

OK, that makes sense. Here's a patch that should work like this - in
copy_file we check if we need to calculate checksums, and either use the
requested copy method, or fall back to the block-by-block copy.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:
On 3/23/24 13:38, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Hmm, this discussion seems to assume that we only use
>> copy_file_range() to copy/clone whole segment files, right?  That's
>> great and may even get most of the available benefit given typical
>> databases with many segments of old data that never changes, but... I
>> think copy_write_range() allows us to go further than the other
>> whole-file clone techniques: we can stitch together parts of an old
>> backup segment file and an incremental backup to create a new file.
>> If you're interested in minimising disk use while also removing
>> dependencies on the preceding chain of backups, then it might make
>> sense to do that even if you *also* have to read the data to compute
>> the checksums, I think?  That's why I mentioned it: if
>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>> search of a problem, has the world ever seen a better problem than
>> pg_combinebackup?
> 
> That makes sense; it's just a different part of the code than I
> thought we were talking about.
> 

Yeah, that's in write_reconstructed_file() and the patch does not touch
that at all. I agree it would be nice to use copy_file_range() in this
part too, and it doesn't seem it'd be that hard to do, I think.

It seems we'd just need a "fork" that either calls pread/pwrite or
copy_file_range, depending on checksums and what was requested.

BTW is there a reason why the code calls "write" and not "pg_pwrite"?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:

On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right?  That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think?  That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
> 
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
> 
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
> 

Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...

I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.

I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_upgrade --copy-file-range

От
Robert Haas
Дата:
On Sat, Mar 23, 2024 at 9:37 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> OK, that makes sense. Here's a patch that should work like this - in
> copy_file we check if we need to calculate checksums, and either use the
> requested copy method, or fall back to the block-by-block copy.

+        Use efficient file cloning (also known as <quote>reflinks</quote> on
+        some systems) instead of copying files to the new cluster.  This can

new cluster -> output directory

I think your version kind of messes up the debug logging. In my
version, every call to copy_file() would emit either "would copy
\"%s\" to \"%s\" using strategy %s" and "copying \"%s\" to \"%s\"
using strategy %s". In your version, the dry_run mode emits a string
similar to the former, but creates separate translatable strings for
each copy method instead of using the same one with a different value
of %s. In non-dry-run mode, I think your version loses the debug
logging altogether.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_upgrade --copy-file-range

От
Robert Haas
Дата:
On Sat, Mar 23, 2024 at 9:47 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> BTW is there a reason why the code calls "write" and not "pg_pwrite"?

I think it's mostly because I learned to code a really long time ago.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_upgrade --copy-file-range

От
Jakub Wartak
Дата:
On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> On 3/23/24 14:47, Tomas Vondra wrote:
> > On 3/23/24 13:38, Robert Haas wrote:
> >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
[..]
> > Yeah, that's in write_reconstructed_file() and the patch does not touch
> > that at all. I agree it would be nice to use copy_file_range() in this
> > part too, and it doesn't seem it'd be that hard to do, I think.
> >
> > It seems we'd just need a "fork" that either calls pread/pwrite or
> > copy_file_range, depending on checksums and what was requested.
> >
>
> Here's a patch to use copy_file_range in write_reconstructed_file too,
> when requested/possible. One thing that I'm not sure about is whether to
> do pg_fatal() if --copy-file-range but the platform does not support it.
[..]

Hi Tomas, so I gave a go to the below patches today:
- v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
- v20240323-2-0002-write_reconstructed_file.patch

My assessment:

v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
looks like more or less good to go
v20240323-2-0002-write_reconstructed_file.patch - needs work and
without that clone/copy_file_range() good effects are unlikely

Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
tables with GiST and GIN indexes , just to see more WAL record types)
and with backups sizes in MB like that:

106831  full
2823    incr.1 # captured after some time with pgbench -R 100
165     incr.2 # captured after some time with pgbench -R 100

Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
/proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
outtest full incr.1 incr.2

Test results of various copies on small I/O constrained XFS device:
normal copy: 31m47.407s
--clone copy: error: file cloning not supported on this platform (it's
due #ifdef of having COPY_FILE_RANGE available)
--copy-file-range: aborted, as it was taking too long , I was
expecting it to accelerate, but it did not... obviously this is the
transparent failover in case of calculating checksums...
--manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
up with ENOSPC [more on this later]
--manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
--manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
loop-fix: 5m1.986s but it creates corruption as it stands

Issues:

1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
compains about win32/mingw:

[15:47:27.184] In file included from copy_file.c:22:
[15:47:27.184] copy_file.c: In function ‘copy_file’:
[15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
this statement may fall through [-Werror=implicit-fallthrough=]
[15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
[15:47:27.184]       |      ^
[15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
[15:47:27.184]    96 |     pg_log_debug("would copy \"%s\" to \"%s\"
(copy_file_range)",
[15:47:27.184]       |     ^~~~~~~~~~~~
[15:47:27.184] copy_file.c:99:4: note: here
[15:47:27.184]    99 |    case COPY_MODE_COPYFILE:
[15:47:27.184]       |    ^~~~
[15:47:27.184] cc1: all warnings being treated as errors

2. I do not know what's the consensus between --clone and
--copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
apply the same logic to the --help so that --clone is not visible
there (for consistency?). Also the "error: file cloning not supported
on this platform " is technically incorrect, Linux does support
ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
over another (so technically it is "available"). Nitpicking I know.

3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
ENOSPACE spiral-of-death-bug symptoms are like that:

strace:
copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192

Notice that dest_off_t (poutoff) is NULL.

(gdb) where
#0  0x00007f2cd56f6733 in copy_file_range (infd=8,
pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9,
poutoff=poutoff@entry=0x0,
    length=length@entry=8192, flags=flags@entry=0) at
../sysdeps/unix/sysv/linux/copy_file_range.c:28
#1  0x00005555ecd077f4 in write_reconstructed_file
(copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true,
checksum_ctx=0x7ffc4cdb7700,
    offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010,
block_length=<optimized out>, output_filename=0x7ffc4cdba910
"outtest/base/5/16427.29",
    input_filename=0x7ffc4cdba510
"incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648
#2  reconstruct_from_incremental_file
(input_filename=input_filename@entry=0x7ffc4cdba510
"incr.2/base/5/INCREMENTAL.16427.29",
    output_filename=output_filename@entry=0x7ffc4cdba910
"outtest/base/5/16427.29",
relative_path=relative_path@entry=0x7ffc4cdbc670 "base/5",
    bare_file_name=bare_file_name@entry=0x5555ee2056ef "16427.29",
n_prior_backups=n_prior_backups@entry=2,
    prior_backup_dirs=prior_backup_dirs@entry=0x7ffc4cdbf248,
manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10
"base/5/16427.29",
    checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864,
checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false,
    copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327

.. it's a spiral of death till ENOSPC. Reverting the
v20240323-2-0002-write_reconstructed_file.patch helps. The problem
lies in that do-wb=-inifity-loop (?) along with NULL for destination
off_t. This seem to solves that thingy(?):

-                       do
-                       {
-                               wb = copy_file_range(s->fd,
&offsetmap[i], wfd, NULL, BLCKSZ, 0);
+                       //do
+                       //{
+                               wb = copy_file_range(s->fd,
&offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0);
                                if (wb < 0)
                                        pg_fatal("error while copying
file range from \"%s\" to \"%s\": %m",

input_filename, output_filename);
-                       } while (wb > 0);
+                       //} while (wb > 0);
 #else

...so that way I've got it down to 5mins.

3. .. but onn startup I've got this after trying psql login: invalid
page in block 0 of relation base/5/1259  . I've again reverted the
v20240323-2-0002 to see if that helped for next-round of
pg_combinebackup --manifest-checksums=NONE --copy-file-range and after
32mins of waiting it did help indeed: I was able to login and select
counts worked and matched properly the data. I've reapplied the
v20240323-2-0002 (with my fix to prevent that endless loop) and the
issue was again(!) there. Probably it's related to the destination
offset. I couldn't find more time to look on it today and the setup
was big 100GB on slow device, so just letting You know as fast as
possible.

4. More efficiency is on the table option (optional patch node ; just
for completeness; I dont think we have time for that? ): even if
v20240323-2-0002 would work, the problem is that it would be sending
syscall for every 8kB. We seem to be performing lots of per-8KB
syscalls which hinder performance (both in copy_file_range and in
normal copy):

pread64(8, ""..., 8192, 369115136)      = 8192 // 369115136 + 8192 =
369123328 (matches next pread offset)
write(9, ""..., 8192)                   = 8192
pread64(8, ""..., 8192, 369123328)      = 8192 // 369123328 + 8192 = 369131520
write(9, ""..., 8192)                   = 8192
pread64(8, ""..., 8192, 369131520)      = 8192 // and so on
write(9, ""..., 8192)                   = 8192

Apparently there's no merging of adjacent IO/s, so pg_combinebackup
wastes lots of time on issuing instead small syscalls but it could
let's say do single pread/write (or even copy_file_range()). I think
it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
had much smaller modifications in my incremental (think of 99% of
static data).

5. I think we should change the subject with new patch revision, so
that such functionality for incremental backups is not buried down in
the pg_upgrade thread ;)

-J.



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:

On 3/26/24 15:09, Jakub Wartak wrote:
> On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>> On 3/23/24 14:47, Tomas Vondra wrote:
>>> On 3/23/24 13:38, Robert Haas wrote:
>>>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [..]
>>> Yeah, that's in write_reconstructed_file() and the patch does not touch
>>> that at all. I agree it would be nice to use copy_file_range() in this
>>> part too, and it doesn't seem it'd be that hard to do, I think.
>>>
>>> It seems we'd just need a "fork" that either calls pread/pwrite or
>>> copy_file_range, depending on checksums and what was requested.
>>>
>>
>> Here's a patch to use copy_file_range in write_reconstructed_file too,
>> when requested/possible. One thing that I'm not sure about is whether to
>> do pg_fatal() if --copy-file-range but the platform does not support it.
> [..]
> 
> Hi Tomas, so I gave a go to the below patches today:
> - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
> - v20240323-2-0002-write_reconstructed_file.patch
> 
> My assessment:
> 
> v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
> looks like more or less good to go

There's some issues with the --dry-run, pointed out by Robert. Should be
fixed in the attached version.

> v20240323-2-0002-write_reconstructed_file.patch - needs work and
> without that clone/copy_file_range() good effects are unlikely
> 
> Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
> tables with GiST and GIN indexes , just to see more WAL record types)
> and with backups sizes in MB like that:
> 
> 106831  full
> 2823    incr.1 # captured after some time with pgbench -R 100
> 165     incr.2 # captured after some time with pgbench -R 100
> 
> Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
> outtest full incr.1 incr.2
> 
> Test results of various copies on small I/O constrained XFS device:
> normal copy: 31m47.407s
> --clone copy: error: file cloning not supported on this platform (it's
> due #ifdef of having COPY_FILE_RANGE available)
> --copy-file-range: aborted, as it was taking too long , I was
> expecting it to accelerate, but it did not... obviously this is the
> transparent failover in case of calculating checksums...
> --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
> to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
> up with ENOSPC [more on this later]

That's really strange.

> --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
> --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> loop-fix: 5m1.986s but it creates corruption as it stands
> 

Thanks. I plan to do more similar tests, once my machines get done with
some other stuff.

> Issues:
> 
> 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> compains about win32/mingw:
> 
> [15:47:27.184] In file included from copy_file.c:22:
> [15:47:27.184] copy_file.c: In function ‘copy_file’:
> [15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
> this statement may fall through [-Werror=implicit-fallthrough=]
> [15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
> [15:47:27.184]       |      ^
> [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
> [15:47:27.184]    96 |     pg_log_debug("would copy \"%s\" to \"%s\"
> (copy_file_range)",
> [15:47:27.184]       |     ^~~~~~~~~~~~
> [15:47:27.184] copy_file.c:99:4: note: here
> [15:47:27.184]    99 |    case COPY_MODE_COPYFILE:
> [15:47:27.184]       |    ^~~~
> [15:47:27.184] cc1: all warnings being treated as errors
> 

Yup, missing break.

> 2. I do not know what's the consensus between --clone and
> --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> apply the same logic to the --help so that --clone is not visible
> there (for consistency?). Also the "error: file cloning not supported
> on this platform " is technically incorrect, Linux does support
> ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> over another (so technically it is "available"). Nitpicking I know.
> 

That's a good question, I'm not sure. But whatever we do, we should do
the same thing in pg_upgrade. Maybe there's some sort of precedent?

> 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> ENOSPACE spiral-of-death-bug symptoms are like that:
> 
> strace:
> copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192
> 
> Notice that dest_off_t (poutoff) is NULL.
> 
> (gdb) where
> #0  0x00007f2cd56f6733 in copy_file_range (infd=8,
> pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9,
> poutoff=poutoff@entry=0x0,
>     length=length@entry=8192, flags=flags@entry=0) at
> ../sysdeps/unix/sysv/linux/copy_file_range.c:28
> #1  0x00005555ecd077f4 in write_reconstructed_file
> (copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true,
> checksum_ctx=0x7ffc4cdb7700,
>     offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010,
> block_length=<optimized out>, output_filename=0x7ffc4cdba910
> "outtest/base/5/16427.29",
>     input_filename=0x7ffc4cdba510
> "incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648
> #2  reconstruct_from_incremental_file
> (input_filename=input_filename@entry=0x7ffc4cdba510
> "incr.2/base/5/INCREMENTAL.16427.29",
>     output_filename=output_filename@entry=0x7ffc4cdba910
> "outtest/base/5/16427.29",
> relative_path=relative_path@entry=0x7ffc4cdbc670 "base/5",
>     bare_file_name=bare_file_name@entry=0x5555ee2056ef "16427.29",
> n_prior_backups=n_prior_backups@entry=2,
>     prior_backup_dirs=prior_backup_dirs@entry=0x7ffc4cdbf248,
> manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10
> "base/5/16427.29",
>     checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864,
> checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false,
>     copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327
> 
> .. it's a spiral of death till ENOSPC. Reverting the
> v20240323-2-0002-write_reconstructed_file.patch helps. The problem
> lies in that do-wb=-inifity-loop (?) along with NULL for destination
> off_t. This seem to solves that thingy(?):
> 
> -                       do
> -                       {
> -                               wb = copy_file_range(s->fd,
> &offsetmap[i], wfd, NULL, BLCKSZ, 0);
> +                       //do
> +                       //{
> +                               wb = copy_file_range(s->fd,
> &offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0);
>                                 if (wb < 0)
>                                         pg_fatal("error while copying
> file range from \"%s\" to \"%s\": %m",
> 
> input_filename, output_filename);
> -                       } while (wb > 0);
> +                       //} while (wb > 0);
>  #else
> 
> ...so that way I've got it down to 5mins.
> 

Yeah, that retry logic is wrong. I ended up copying the check from the
"regular" copy branch, which simply bails out if copy_file_range returns
anything but the expected 8192.

I wonder if this should deal with partial writes, though. I mean, it's
allowed copy_file_range() copies only some of the bytes - I don't know
how often / in what situations that happens, though ... And if we want
to handle that for copy_file_range(), pwrite() needs the same treatment.

> 3. .. but onn startup I've got this after trying psql login: invalid
> page in block 0 of relation base/5/1259  . I've again reverted the
> v20240323-2-0002 to see if that helped for next-round of
> pg_combinebackup --manifest-checksums=NONE --copy-file-range and after
> 32mins of waiting it did help indeed: I was able to login and select
> counts worked and matched properly the data. I've reapplied the
> v20240323-2-0002 (with my fix to prevent that endless loop) and the
> issue was again(!) there. Probably it's related to the destination
> offset. I couldn't find more time to look on it today and the setup
> was big 100GB on slow device, so just letting You know as fast as
> possible.
> 

Can you see if you can still reproduce this with the attached version?

> 4. More efficiency is on the table option (optional patch node ; just
> for completeness; I dont think we have time for that? ): even if
> v20240323-2-0002 would work, the problem is that it would be sending
> syscall for every 8kB. We seem to be performing lots of per-8KB
> syscalls which hinder performance (both in copy_file_range and in
> normal copy):
> 
> pread64(8, ""..., 8192, 369115136)      = 8192 // 369115136 + 8192 =
> 369123328 (matches next pread offset)
> write(9, ""..., 8192)                   = 8192
> pread64(8, ""..., 8192, 369123328)      = 8192 // 369123328 + 8192 = 369131520
> write(9, ""..., 8192)                   = 8192
> pread64(8, ""..., 8192, 369131520)      = 8192 // and so on
> write(9, ""..., 8192)                   = 8192
> 
> Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> wastes lots of time on issuing instead small syscalls but it could
> let's say do single pread/write (or even copy_file_range()). I think
> it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> had much smaller modifications in my incremental (think of 99% of
> static data).
> 

Yes, I've been thinking about exactly this optimization, but I think
we're way past proposing this for PG17. The changes that would require
in reconstruct_from_incremental_file are way too significant. Has to
wait for PG18 ;-)

I do think there's more on the table, as mentioned by Thomas a couple
days ago - maybe we shouldn't approach clone/copy_file_range merely as
an optimization to save time, it might be entirely reasonable to do this
simply to allow the filesystem to do CoW magic and save space (even if
we need to read the data and recalculate the checksum, which now
disables these copy methods).


> 5. I think we should change the subject with new patch revision, so
> that such functionality for incremental backups is not buried down in
> the pg_upgrade thread ;)
> 

OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_combinebackup --copy-file-range

От
Jakub Wartak
Дата:
On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
[..]
>
> That's really strange.

Hi Tomas, but it looks like it's fixed now :)

> > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
> > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> > loop-fix: 5m1.986s but it creates corruption as it stands
> >
>
> Thanks. I plan to do more similar tests, once my machines get done with
> some other stuff.

Please do so as I do not trust my fingers :-)

> > Issues:
> >
> > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> > compains about win32/mingw:
> >
[..]
> >
>
> Yup, missing break.

Now it's https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10
, reproducible with "make -s headerscheck
EXTRAFLAGS='-fmax-errors=10'":
/tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error:
unknown type name ‘CopyMode’ / CopyMode copy_mode);
to me it looks like reconstruct.h needs to include definition of
CopyMode which is in "#include "copy_file.h"

> > 2. I do not know what's the consensus between --clone and
> > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> > apply the same logic to the --help so that --clone is not visible
> > there (for consistency?). Also the "error: file cloning not supported
> > on this platform " is technically incorrect, Linux does support
> > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> > over another (so technically it is "available"). Nitpicking I know.
> >
>
> That's a good question, I'm not sure. But whatever we do, we should do
> the same thing in pg_upgrade. Maybe there's some sort of precedent?

Sigh, you are right... It's consistent hell.

> > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> > ENOSPACE spiral-of-death-bug symptoms are like that:
[..]
>
> Yeah, that retry logic is wrong. I ended up copying the check from the
> "regular" copy branch, which simply bails out if copy_file_range returns
> anything but the expected 8192.
>
> I wonder if this should deal with partial writes, though. I mean, it's
> allowed copy_file_range() copies only some of the bytes - I don't know
> how often / in what situations that happens, though ... And if we want
> to handle that for copy_file_range(), pwrite() needs the same treatment.

Maybe that helps?
https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427
, it's harder than I anticipated (we can ignore the sparse logic
though, I think)

> > 3. .. but onn startup I've got this after trying psql login: invalid
> > page in block 0 of relation base/5/1259  .
[..]
>
> Can you see if you can still reproduce this with the attached version?

Looks like it's fixed now and it works great (~3min, multiple times)!

BTW: I've tried to also try it over NFSv4 over loopback with XFS as
copy_file_range() does server-side optimization probably, but somehow
it was so slow there that's it is close to being unusable (~9GB out of
104GB reconstructed after 45mins) - maybe it's due to NFS mount opts,
i don't think we should worry too much. I think it's related to
missing the below optimization if that matters. I think it's too early
to warn users about NFS (I've spent on it just 10 mins), but on the
other hand people might complain it's broken...

> > Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> > wastes lots of time on issuing instead small syscalls but it could
> > let's say do single pread/write (or even copy_file_range()). I think
> > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> > had much smaller modifications in my incremental (think of 99% of
> > static data).
> >
>
> Yes, I've been thinking about exactly this optimization, but I think
> we're way past proposing this for PG17. The changes that would require
> in reconstruct_from_incremental_file are way too significant. Has to
> wait for PG18 ;-)

Sure thing!

> I do think there's more on the table, as mentioned by Thomas a couple
> days ago - maybe we shouldn't approach clone/copy_file_range merely as
> an optimization to save time, it might be entirely reasonable to do this
> simply to allow the filesystem to do CoW magic and save space (even if
> we need to read the data and recalculate the checksum, which now
> disables these copy methods).

Sure ! I think time will still be a priority though, as
pg_combinebackup duration impacts RTO while disk space is relatively
cheap.

One could argue that reconstructing 50TB will be a challenge though.
Now my tests indicate space saving is already happening with 0002
patch - 100GB DB / full backup stats look like that (so we are good I
think when using CoW - not so without using CoW) -- or i misunderstood
something?:

root@jw-test-1:/backups# du -sm /backups/
214612  /backups/
root@jw-test-1:/backups# du -sm *
106831  full
2823    incr.1
165     incr.2
104794  outtest
root@jw-test-1:/backups# df -h . # note this double confirms that just
114GB is used (XFS), great!
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       500G  114G  387G  23% /backups
root@jw-test-1:/backups# # https://github.com/pwaller/sharedextents
root@jw-test-1:/backups# ./sharedextents-linux-amd64
full/base/5/16427.68 outtest/base/5/16427.68
1056915456 / 1073741824 bytes (98.43%) # extents reuse

Now I was wondering a little bit if the huge XFS extent allocation
won't hurt read performance (probably they were created due many
independent copy_file_range() calls):

root@jw-test-1:/backups# filefrag full/base/5/16427.68
full/base/5/16427.68: 1 extent found
root@jw-test-1:/backups# filefrag outtest/base/5/16427.68
outtest/base/5/16427.68: 3979 extents found

However in first look on seq reads of such CoW file it's still good
(I'm assuming such backup after reconstruction would be copied back to
the proper DB server from this backup server):

root@jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches
root@jw-test-1:/backups# time cat outtest/base/5/16427.68 > /dev/null
real    0m4.286s
root@jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches
root@jw-test-1:/backups# time cat full/base/5/16427.68 > /dev/null
real    0m4.325s

Now Thomas wrote there "then it might make sense to do that even if
you *also* have to read the data to compute the checksums, I think? "
... sounds like read(), checksum() and still do copy_file_range()
instead of pwrite? PG 18 ? :D

-J.



Re: pg_upgrade --copy-file-range

От
Robert Haas
Дата:
On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> The patch I shared a couple minutes ago should fix this, effectively
> restoring the original debug behavior. I liked the approach with calling
> strategy_implementation a bit more, I wonder if it'd be better to go
> back to that for the "accelerated" copy methods, somehow.

Somehow I don't see this patch?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_upgrade --copy-file-range

От
Tomas Vondra
Дата:

On 3/28/24 21:45, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> The patch I shared a couple minutes ago should fix this, effectively
>> restoring the original debug behavior. I liked the approach with calling
>> strategy_implementation a bit more, I wonder if it'd be better to go
>> back to that for the "accelerated" copy methods, somehow.
> 
> Somehow I don't see this patch?
> 

It's here:

https://www.postgresql.org/message-id/90866c27-265a-4adb-89d0-18c8dd22bc19%40enterprisedb.com

I did change the subject to reflect that it's no longer about
pg_upgrade, maybe that breaks the threading for you somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_combinebackup --copy-file-range

От
Robert Haas
Дата:
On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> [ new patches ]

Tomas, thanks for pointing me to this email; as you speculated, gmail
breaks threading if the subject line changes.

The documentation still needs work here:

- It refers to --link mode, which is not a thing.

- It should talk about the fact that in some cases block-by-block
copying will still be needed, possibly mentioning that it specifically
happens when the old backup manifest is not available or does not
contain checksums or does not contain checksums of the right type, or
maybe just being a bit vague.

In copy_file.c:

- You added an unnecessary blank line to the beginning of a comment block.

- You could keep the strategy_implementation variable here. I don't
think that's 100% necessary, but now that you've got the code
structured this way, there's no compelling reason to remove it.

- I don't know what the +/* XXX maybe this should do the check
internally, same as the other functions? */ comment is talking about.

- Maybe these functions should have header comments.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
On 3/29/24 15:23, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> [ new patches ]
> 
> Tomas, thanks for pointing me to this email; as you speculated, gmail
> breaks threading if the subject line changes.
> 
> The documentation still needs work here:
> 
> - It refers to --link mode, which is not a thing.
> 
> - It should talk about the fact that in some cases block-by-block
> copying will still be needed, possibly mentioning that it specifically
> happens when the old backup manifest is not available or does not
> contain checksums or does not contain checksums of the right type, or
> maybe just being a bit vague.
> 
> In copy_file.c:
> 
> - You added an unnecessary blank line to the beginning of a comment block.
> 

Thanks, should be all cleaned up now, I think.

> - You could keep the strategy_implementation variable here. I don't
> think that's 100% necessary, but now that you've got the code
> structured this way, there's no compelling reason to remove it.
> 

Yeah, I think you're right. The strategy_implementation seemed a bit
weird to me because we now have 4 functions with different signatures.
Most only take srd/dst, but copy_file_blocks() also takes checksum. And
it seemed better to handle everything the same way, rather than treating
copy_file_blocks as an exception.

But it's not that bad, so 0001 has strategy_implementation again. But
I'll get back to this in a minute.

> - I don't know what the +/* XXX maybe this should do the check
> internally, same as the other functions? */ comment is talking about.
> 

I think this is stale. The XXX is about how the various functions
detect/report support. In most we have the ifdefs/pg_fatal() inside the
function, but CopyFile() has nothing like that, because the detection
happens earlier. I wasn't sure if maybe we should make all these
functions more alike, but I don't think we should.

> - Maybe these functions should have header comments.
> 

Right, added.


I was thinking about the comment [1] from a couple a days ago, where
Thomas suggested that maybe we should try doing the CoW stuff
(clone/copy_file_range) even in cases when we need to read the block,
say to calculate checksum, or even reconstruct from incremental backups.

I wasn't sure how big the benefits of the patches shared so far might
be, so I decided to do some tests. I did a fairly simple thing:

1) initialize a cluster with pgbench scale 5000 (~75GB)

2) create a full backup

3) do a run that updates ~1%, 10% and 20% of the blocks

4) create an incremental backup after each run

5) do pg_combinebackup for for each of the increments, with
block-by-block copy and copy_file_range, measure how long it takes and
how much disk space it consumes

I did this on xfs and btrfs, and it quickly became obvious that there's
very little benefit unless --no-manifest is used. Which makes perfect
sense, because pgbench is uniform updates so all segments need to be
reconstructed from increments (so copy_file.c is mostly irrelevant), and
write_reconstructed_file only uses copy_file_range() without checksums.

I don't know how common --no-manifest is going to be, but I guess people
will want to keep manifests in at least some backup schemes (e.g. to
rebuild full backup instead of having to take a full backup regularly).

So I decided to take a stab at Thomas' idea, i.e. reading the data to
calculate checksums, but then using copy_file_range instead of just
writing the data onto disk. This is in 0003, which relaxes the
conditions in 0002 shared a couple days ago. And this helped a lot.

The attached PDF shows results for xfs/btrfs. Charts on the left are
disk space occupied by the reconstructed backup, measured as difference
between "df" before and after running pg_combinebackup. The duration of
the pg_combinebackup execution is on the right. First row is without
manifest (i.e. --no-manifest), the second row is with manifest.

The 1%, 10% and 20% groups are for the various increments, updating
different fractions of the database.

The database is ~75GB, so that's what we expect a plain copy to have. If
there are some CoW benefits of copy_file_range, allowing the fs to reuse
some of the space or, the disk space should be reduced.  And similarly,
there could/should be some improvements in pg_combinebackup duration.

Each bar is a different copy method and patch:

* copy on master/0001/0002/0003 - we don't expect any difference between
these, it should all perform the same and use the "full" space

* copy_file_range on 0001/0002/0003 - 0001 should perform the same as
copy, because it's only about full-segment copies, and we don't any of
those here, 0002/0003 should help, depending on --no-manifest

And indeed, this is what we see. 0002/0003 use only a fraction of disk
space, roughly the same as the updated fraction (which matches the size
of the increment). This is nice.

For duration, the benefits seem to depend on the file system. For btrfs
it actually is faster, as expected. 0002/0003 saves maybe 30-50% of
time, compared to block-by-block copy. On XFS it's not that great, the
copy_file_range is actually slower by up to about 50%. And this is not
about the extra read - this affects the 0002/no-manifest case too, where
the read is not necessary.

I think this is fine. It's a tradeoff, where on some filesystems you can
save time or space, and on other filesystems you can save both. That's a
tradeoff for the users to decide, I think.

I'll see how this works on EXT4/ZFS next ...

But thinking about this a bit more, I realized there's no reason not to
apply the same logic to the copy_file part. I mean, if we need to copy a
file but also calculate a checksum, we can simply do the clone using
clone/copy_file_range, but then also read the file and calculate the
checksum ...

0004 does this, by simply passing the checksum_cxt around, which also
has the nice consequence that all the functions now have the same
signature, which makes the strategy_implementation work for all cases. I
need to do more testing of this, but like how this looks.

Of course, maybe there's not an agreement this is the right way to
approach this, and we should do the regular block-by-block copy?

There's one more change in 0003 - the checks if clone/copy_file_range
are supported by the platform now happen right at the beginning when
parsing the arguments, so that when a user specifies one of those
options, the error happens right away instead of sometime later when we
happen to hit one of those pg_fatal() places.

I think this is the right place to do these checks, as it makes the
write_reconstructed_file much easier to understand (without all the
ifdefs etc.).

But there's an argument whether this should fail with pg_fatal() or just
fallback to the default copy method.

BTW I wonder if it makes sense to only allow one of those methods? At
the moment the user can specify both --clone and --copy-file-range, and
which one "wins" depends on the order in which they are specified. Seems
confusing at best. But maybe it'd make sense to allow both, and e.g. use
clone() to copy whole segments and copy_file_range() for other places?


regards


[1]
https://www.postgresql.org/message-id/CA%2BhUKG%2B8KDk%2BpM6vZHWT6XtZzh-sdieUDohcjj0fia6aqK3Oxg%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_combinebackup --copy-file-range

От
Thomas Munro
Дата:
On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> So I decided to take a stab at Thomas' idea, i.e. reading the data to
> ...
> I'll see how this works on EXT4/ZFS next ...

Wow, very cool!  A couple of very quick thoughts/notes:

ZFS: the open source version only gained per-file block cloning in
2.2, so if you're on an older release I expect copy_file_range() to
work but not really do the magic thing.  On the FreeBSD version you
also have to turn cloning on with a sysctl because people were worried
about bugs in early versions so by default you still get actual
copying, not sure if you need something like that on the Linux
version...  (Obviously ZFS is always completely COW-based, but before
the new block cloning stuff it could only share blocks by its own
magic deduplication if enabled, or by cloning/snapshotting a whole
dataset/mountpoint; there wasn't a way to control it explicitly like
this.)

Alignment: block sharing on any fs requires it.  I haven't re-checked
recently but IIRC the incremental file format might have a
non-block-sized header?  That means that if you copy_file_range() from
both the older backup and also the incremental backup, only the former
will share blocks, and the latter will silently be done by copying to
newly allocated blocks.  If that's still true, I'm not sure how hard
it would be to tweak the format to inject some padding and to make
sure that there isn't any extra header before each block.



Re: pg_combinebackup --copy-file-range

От
Thomas Munro
Дата:
+            wb = copy_file_range(s->fd, &offsetmap[i], wfd, NULL, BLCKSZ, 0);

Can you collect adjacent blocks in one multi-block call?  And then I
think the contract is that you need to loop if it returns short.



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
On 3/31/24 03:03, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> So I decided to take a stab at Thomas' idea, i.e. reading the data to
>> ...
>> I'll see how this works on EXT4/ZFS next ...
> 
> Wow, very cool!  A couple of very quick thoughts/notes:
> 
> ZFS: the open source version only gained per-file block cloning in
> 2.2, so if you're on an older release I expect copy_file_range() to
> work but not really do the magic thing.  On the FreeBSD version you
> also have to turn cloning on with a sysctl because people were worried
> about bugs in early versions so by default you still get actual
> copying, not sure if you need something like that on the Linux
> version...  (Obviously ZFS is always completely COW-based, but before
> the new block cloning stuff it could only share blocks by its own
> magic deduplication if enabled, or by cloning/snapshotting a whole
> dataset/mountpoint; there wasn't a way to control it explicitly like
> this.)
> 

I'm on 2.2.2 (on Linux). But there's something wrong, because the
pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.

I'm not sure it's a ZFS config issue, though, because it's not CPU or
I/O bound, and I see this on both machines. And some simple dd tests
show the zpool can do 10x the throughput. Could this be due to the file
header / pool alignment?

> Alignment: block sharing on any fs requires it.  I haven't re-checked
> recently but IIRC the incremental file format might have a
> non-block-sized header?  That means that if you copy_file_range() from
> both the older backup and also the incremental backup, only the former
> will share blocks, and the latter will silently be done by copying to
> newly allocated blocks.  If that's still true, I'm not sure how hard
> it would be to tweak the format to inject some padding and to make
> sure that there isn't any extra header before each block.

I admit I'm not very familiar with the format, but you're probably right
there's a header, and header_length does not seem to consider alignment.
make_incremental_rfile simply does this:

    /* Remember length of header. */
    rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
        sizeof(rf->truncation_block_length) +
        sizeof(BlockNumber) * rf->num_blocks;

and sendFile() does the same thing when creating incremental basebackup.
I guess it wouldn't be too difficult to make sure to align this to
BLCKSZ or something like this. I wonder if the file format is documented
somewhere ... It'd certainly be nicer to tweak before v18, if necessary.

Anyway, is that really a problem? I mean, in my tests the CoW stuff
seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
that's why it took longer on XFS ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_combinebackup --copy-file-range

От
Thomas Munro
Дата:
On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I'm on 2.2.2 (on Linux). But there's something wrong, because the
> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>
> I'm not sure it's a ZFS config issue, though, because it's not CPU or
> I/O bound, and I see this on both machines. And some simple dd tests
> show the zpool can do 10x the throughput. Could this be due to the file
> header / pool alignment?

Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
the same 128kB record as you copy_file_range 16 x 8kB blocks?
(Guessing you might be using the default recordsize?)

> I admit I'm not very familiar with the format, but you're probably right
> there's a header, and header_length does not seem to consider alignment.
> make_incremental_rfile simply does this:
>
>     /* Remember length of header. */
>     rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
>         sizeof(rf->truncation_block_length) +
>         sizeof(BlockNumber) * rf->num_blocks;
>
> and sendFile() does the same thing when creating incremental basebackup.
> I guess it wouldn't be too difficult to make sure to align this to
> BLCKSZ or something like this. I wonder if the file format is documented
> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>
> Anyway, is that really a problem? I mean, in my tests the CoW stuff
> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
> that's why it took longer on XFS ...

Yeah I'm not sure, I assume it did more allocating and copying because
of that.  It doesn't matter and it would be fine if a first version
weren't as good as possible, and fine if we tune the format later once
we know more, ie leaving improvements on the table.  I just wanted to
share the observation.  I wouldn't be surprised if the block-at-a-time
coding makes it slower and maybe makes the on disk data structures
worse, but I dunno I'm just guessing.

It's also interesting but not required to figure out how to tune ZFS
well for this purpose right now...



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:

On 3/31/24 06:46, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> I'm on 2.2.2 (on Linux). But there's something wrong, because the
>> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>>
>> I'm not sure it's a ZFS config issue, though, because it's not CPU or
>> I/O bound, and I see this on both machines. And some simple dd tests
>> show the zpool can do 10x the throughput. Could this be due to the file
>> header / pool alignment?
> 
> Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
> the same 128kB record as you copy_file_range 16 x 8kB blocks?
> (Guessing you might be using the default recordsize?)
> 

No, I reduced the record size to 8kB. And the pgbench init takes about
the same as on other filesystems on this hardware, I think. ~10 minutes
for scale 5000.

>> I admit I'm not very familiar with the format, but you're probably right
>> there's a header, and header_length does not seem to consider alignment.
>> make_incremental_rfile simply does this:
>>
>>     /* Remember length of header. */
>>     rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
>>         sizeof(rf->truncation_block_length) +
>>         sizeof(BlockNumber) * rf->num_blocks;
>>
>> and sendFile() does the same thing when creating incremental basebackup.
>> I guess it wouldn't be too difficult to make sure to align this to
>> BLCKSZ or something like this. I wonder if the file format is documented
>> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>>
>> Anyway, is that really a problem? I mean, in my tests the CoW stuff
>> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
>> that's why it took longer on XFS ...
> 
> Yeah I'm not sure, I assume it did more allocating and copying because
> of that.  It doesn't matter and it would be fine if a first version
> weren't as good as possible, and fine if we tune the format later once
> we know more, ie leaving improvements on the table.  I just wanted to
> share the observation.  I wouldn't be surprised if the block-at-a-time
> coding makes it slower and maybe makes the on disk data structures
> worse, but I dunno I'm just guessing.
> 
> It's also interesting but not required to figure out how to tune ZFS
> well for this purpose right now...

No idea. Any idea if there's some good ZFS statistics to check?


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
Hi,

I've been running some benchmarks and experimenting with various stuff,
trying to improve the poor performance on ZFS, and the regression on XFS
when using copy_file_range. And oh boy, did I find interesting stuff ...

Attached is a PDF with results of my benchmark for ZFS/XFS/BTRFS, on my
two machines. I already briefly described what the benchmark does, but
to clarify:

1) generate backups: initialize pgbench scale 5000, do full backup,
update roughly 1%, 10% and 20% blocks and do an incremental backup after
each of those steps

2) combine backups: full + 1%, full + 1% + 10%, full + 1% + 10% + 20%

3) measure how long it takes and how much more disk space is used (to
see how well the CoW stuff works)

4) after each pg_combinebackup run to pg_verifybackup, start the cluster
to finish recovery, run pg_checksums --check (to check the patches don't
produce something broken)

There's a lot of interesting stuff to discuss, some of which was already
mentioned in this thread earlier - in particular, I want to talk about
block alignment, prefetching and processing larger chunks of blocks.

Attached is also all the patches including the ugly WIP parts discussed
later, complete results if you want to do your own analysis, and the
scripts used to generate/restore scripts.

FWIW I'm not claiming the patches are commit-ready (especially the new
WIP parts), but should be correct and good enough for discussion (that
applies especially to 0007). I think I could get them ready in a day or
two, but I'd like some feedback to my findings, and also if someone
would have objections to get this in so short before the feature freeze,
I'd prefer to know about that.

The patches are numbered the same as in the benchmark results, i.e. 0001
is "1", 0002 is "2" etc. The "0-baseline" option is current master
without any patches.


Now to the findings ....


1) block alignment
------------------

This was mentioned by Thomas a couple days ago, when he pointed out the
incremental files have a variable-length header (to record which blocks
are stored in the file), followed by the block data, which means the
block data is not aligned to fs block. I haven't realized this, I just
used whatever the reconstruction function received, but Thomas pointed
out this may interfere with CoW, which needs the blocks to be aligned.

And I think he's right, and my tests confirm this. I did a trivial patch
to align the blocks to 8K boundary, by forcing the header to be a
multiple of 8K (I think 4K alignment would be enough). See the 0001
patch that does this.

And if I measure the disk space used by pg_combinebackup, and compare
the results with results without the patch ([1] from a couple days
back), I see this:

   pct      not aligned        aligned
  -------------------------------------
    1%             689M            19M
   10%            3172M            22M
   20%           13797M            27M

Yes, those numbers are correct. I didn't believe this at first, but the
backups are valid/verified, checksums are OK, etc. BTRFS has similar
numbers (e.g. drop from 20GB to 600MB).

If you look at the charts in the PDF, charts for on-disk space are on
the right side. It might seem like copy_file_range/CoW has no impact,
but that's just an illusion - the bars for the last three cases are so
small it's difficult to see them (especially on XFS). While this does
not show the impact of alignment (because all of the cases in these runs
have blocks aligned), it shows how tiny the backups can be made. But it
does have significant impact, per the preceding paragraph.

This also affect the prefetching, that I'm going to talk about next. But
having the blocks misaligned (spanning multiple 4K pages) forces the
system to prefetch more pages than necessary. I don't know how big the
impact is, because the prefetch patch is 0002, so I only have results
for prefetching on aligned blocks, but I don't see how it could not have
a cost.

I do think we should just align the blocks properly. The 0001 patch does
that simply by adding a bunch of \0 bytes up to the next 8K boundary.
Yes, this has a cost - if you have tiny files with only one or two
blocks changed, the increment file will be a bit larger. Files without
any blocks don't need alignment/padding, and as the number of blocks
increases, it gets negligible pretty quickly. Also, files use a multiple
of fs blocks anyway, so if we align to 4K blocks it wouldn't actually
need more space at all. And even if it does, it's all \0, so pretty damn
compressible (and I'm sorry, but if you care about tiny amounts of data
added by alignment, but refuse to use compression ...).

I think we absolutely need to align the blocks in the incremental files,
and I think we should do that now. I think 8K would work, but maybe we
should add alignment parameter to basebackup & manifest?

The reason why I think maybe this should be a basebackup parameter is
the recent discussion about large fs blocks - it seems to be in the
works, so maybe better to be ready and not assume all fs have 4K.

And I think we probably want to do this now, because this affects all
tools dealing with incremental backups - even if someone writes a custom
version of pg_combinebackup, it will have to deal with misaligned data.
Perhaps there might be something like pg_basebackup that "transforms"
the data received from the server (and also the backup manifest), but
that does not seem like a great direction.

Note: Of course, these space savings only exist thanks to sharing blocks
with the input backups, because the blocks in the combined backup point
to one of the other backups. If those old backups are removed, then the
"saved space" disappears because there's only a single copy.


2) prefetch
-----------

I was very puzzled by the awful performance on ZFS. When every other fs
(EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
advice I could think of, with almost no effect.

Ultimately I decided that it probably is the "no readahead" behavior
I've observed on ZFS. I assume it's because it doesn't use the page
cache where the regular readahead is detected etc. And there's no
prefetching in pg_combinebackup, so I decided to an experiment and added
a trivial explicit prefetch when reconstructing the file - every time
we'd read data from a file, we do posix_fadvise for up to 128 blocks
ahead (similar to what bitmap heap scan code does). See 0002.

And tadaaa - the duration dropped from 900-1000 seconds to only about
250-300 seconds, so an improvement of a factor of 3-4x. I think this is
pretty massive.

There's a couple more interesting ZFS details - the prefetching seems to
be necessary even when using copy_file_range() and don't need to read
the data (to calculate checksums). This is why the "manifest=off" chart
has the strange group of high bars at the end - the copy cases are fast
because prefetch happens, but if we switch to copy_file_range() there
are no prefetches and it gets slow.

This is a bit bizarre, especially because the manifest=on cases are
still fast, exactly because the pread + prefetching still happens. I'm
sure users would find this puzzling.

Unfortunately, the prefetching is not beneficial for all filesystems.
For XFS it does not seem to make any difference, but on BTRFS it seems
to cause a regression.

I think this means we may need a "--prefetch" option, that'd force
prefetching, probably both before pread and copy_file_range. Otherwise
people on ZFS are doomed and will have poor performance.


3) bulk operations
------------------

Another thing suggested by Thomas last week was that maybe we should try
detecting longer runs of blocks coming from the same file, and operate
on them as a single chunk of data. If you see e.g. 32 blocks, instead of
doing read/write or copy_file_range for each of them, we could simply do
one call for all those blocks at once.

I think this is pretty likely, especially for small incremental backups
where most of the blocks will come from the full backup. And I was
suspecting the XFS regression (where the copy-file-range was up to
30-50% slower in some cases, see [1]) is related to this, because the
perf profiles had stuff like this:

  97.28%   2.10%  pg_combinebacku  [kernel.vmlinux]  [k]
    |
    |--95.18%--entry_SYSCALL_64
    |    |
    |     --94.99%--do_syscall_64
    |       |
    |       |--74.13%--__do_sys_copy_file_range
    |       |    |
    |       |     --73.72%--vfs_copy_file_range
    |       |       |
    |       |        --73.14%--xfs_file_remap_range
    |       |            |
    |       |            |--70.65%--xfs_reflink_remap_blocks
    |       |            |    |
    |       |            |     --69.86%--xfs_reflink_remap_extent

So I took a stab at this in 0007, which detects runs of blocks coming
from the same source file (limited to 128 blocks, i.e. 1MB). I only did
this for the copy_file_range() calls in 0007, and the results for XFS
look like this (complete results are in the PDF):

           old (block-by-block)        new (batches)
  ------------------------------------------------------
    1%          150s                       4s
   10%        150-200s                    46s
   20%        150-200s                    65s

Yes, once again, those results are real, the backups are valid etc. So
not only it takes much less space (thanks to block alignment), it also
takes much less time (thanks to bulk operations).

The cases with "manifest=on" improve too, but not nearly this much. I
believe this is simply because the read/write still happens block by
block. But it shouldn't be difficult to do in a bulk manner too (we
already have the range detected, but I was lazy).



[1]
https://www.postgresql.org/message-id/0e27835d-dab5-49cd-a3ea-52cf6d9ef59e%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_combinebackup --copy-file-range

От
Thomas Munro
Дата:
On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> And I think he's right, and my tests confirm this. I did a trivial patch
> to align the blocks to 8K boundary, by forcing the header to be a
> multiple of 8K (I think 4K alignment would be enough). See the 0001
> patch that does this.
>
> And if I measure the disk space used by pg_combinebackup, and compare
> the results with results without the patch ([1] from a couple days
> back), I see this:
>
>    pct      not aligned        aligned
>   -------------------------------------
>     1%             689M            19M
>    10%            3172M            22M
>    20%           13797M            27M
>
> Yes, those numbers are correct. I didn't believe this at first, but the
> backups are valid/verified, checksums are OK, etc. BTRFS has similar
> numbers (e.g. drop from 20GB to 600MB).

Fantastic.

> I think we absolutely need to align the blocks in the incremental files,
> and I think we should do that now. I think 8K would work, but maybe we
> should add alignment parameter to basebackup & manifest?
>
> The reason why I think maybe this should be a basebackup parameter is
> the recent discussion about large fs blocks - it seems to be in the
> works, so maybe better to be ready and not assume all fs have 4K.
>
> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

+1, and I think BLCKSZ is the right choice.

> I was very puzzled by the awful performance on ZFS. When every other fs
> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
> advice I could think of, with almost no effect.
>
> Ultimately I decided that it probably is the "no readahead" behavior
> I've observed on ZFS. I assume it's because it doesn't use the page
> cache where the regular readahead is detected etc. And there's no
> prefetching in pg_combinebackup, so I decided to an experiment and added
> a trivial explicit prefetch when reconstructing the file - every time
> we'd read data from a file, we do posix_fadvise for up to 128 blocks
> ahead (similar to what bitmap heap scan code does). See 0002.
>
> And tadaaa - the duration dropped from 900-1000 seconds to only about
> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
> pretty massive.

Interesting.  ZFS certainly has its own prefetching heuristics with
lots of logic and settings, but it could be that it's using
strict-next-block detection of access pattern (ie what I called
effective_io_readahead_window=0 in the streaming I/O thread) instead
of a window (ie like the Linux block device level read ahead where,
AFAIK, if you access anything in that sliding window it is triggered),
and perhaps your test has a lot of non-contiguous but close-enough
blocks?  (Also reminds me of the similar discussion on the BHS thread
about distinguishing sequential access from
mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
fine line between them.)

You could double-check this and related settings (for example I think
it might disable itself automatically if you're on a VM with small RAM
size):

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable

> There's a couple more interesting ZFS details - the prefetching seems to
> be necessary even when using copy_file_range() and don't need to read
> the data (to calculate checksums). This is why the "manifest=off" chart
> has the strange group of high bars at the end - the copy cases are fast
> because prefetch happens, but if we switch to copy_file_range() there
> are no prefetches and it gets slow.

Hmm, at a guess, it might be due to prefetching the dnode (root object
for a file) and block pointers, ie the structure but not the data
itself.

> This is a bit bizarre, especially because the manifest=on cases are
> still fast, exactly because the pread + prefetching still happens. I'm
> sure users would find this puzzling.
>
> Unfortunately, the prefetching is not beneficial for all filesystems.
> For XFS it does not seem to make any difference, but on BTRFS it seems
> to cause a regression.
>
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
interesting research topic for a potential ZFS patch:
prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
reading the relevant source today, but I'll figure it out soonish...)

> So I took a stab at this in 0007, which detects runs of blocks coming
> from the same source file (limited to 128 blocks, i.e. 1MB). I only did
> this for the copy_file_range() calls in 0007, and the results for XFS
> look like this (complete results are in the PDF):
>
>            old (block-by-block)        new (batches)
>   ------------------------------------------------------
>     1%          150s                       4s
>    10%        150-200s                    46s
>    20%        150-200s                    65s
>
> Yes, once again, those results are real, the backups are valid etc. So
> not only it takes much less space (thanks to block alignment), it also
> takes much less time (thanks to bulk operations).

Again, fantastic.



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
On 4/1/24 23:45, Thomas Munro wrote:
> ...
>>
>> I was very puzzled by the awful performance on ZFS. When every other fs
>> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
>> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
>> advice I could think of, with almost no effect.
>>
>> Ultimately I decided that it probably is the "no readahead" behavior
>> I've observed on ZFS. I assume it's because it doesn't use the page
>> cache where the regular readahead is detected etc. And there's no
>> prefetching in pg_combinebackup, so I decided to an experiment and added
>> a trivial explicit prefetch when reconstructing the file - every time
>> we'd read data from a file, we do posix_fadvise for up to 128 blocks
>> ahead (similar to what bitmap heap scan code does). See 0002.
>>
>> And tadaaa - the duration dropped from 900-1000 seconds to only about
>> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
>> pretty massive.
> 
> Interesting.  ZFS certainly has its own prefetching heuristics with
> lots of logic and settings, but it could be that it's using
> strict-next-block detection of access pattern (ie what I called
> effective_io_readahead_window=0 in the streaming I/O thread) instead
> of a window (ie like the Linux block device level read ahead where,
> AFAIK, if you access anything in that sliding window it is triggered),
> and perhaps your test has a lot of non-contiguous but close-enough
> blocks?  (Also reminds me of the similar discussion on the BHS thread
> about distinguishing sequential access from
> mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
> fine line between them.)
> 

I don't think the files have a lot of non-contiguous but close-enough
blocks (it's rather that we'd skip blocks that need to come from a later
incremental file). The backups are generated to have a certain fraction
of modified blocks.

For example the smallest backup has 1% means 99% of blocks comes from
the base backup, and 1% comes from the increment. And indeed, the whole
database is ~75GB and the backup is ~740MB. Which means that on average
there will be runs of 99 blocks in the base backup, then skip 1 block
(to come from the increment), and then again 99-1-99-1. So it's very
sequential, almost no holes, and the increment is 100% sequential. And
it still does not seem to prefetch anything.


> You could double-check this and related settings (for example I think
> it might disable itself automatically if you're on a VM with small RAM
> size):
> 
> https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable
> 

I haven't touched that parameter at all, and it's "enabled" by default:

# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0

While trying to make the built-in prefetch work I reviewed the other
parameters with the "prefetch" tag, without success. And I haven't seen
any advice on how to make it work ...

>> There's a couple more interesting ZFS details - the prefetching seems to
>> be necessary even when using copy_file_range() and don't need to read
>> the data (to calculate checksums). This is why the "manifest=off" chart
>> has the strange group of high bars at the end - the copy cases are fast
>> because prefetch happens, but if we switch to copy_file_range() there
>> are no prefetches and it gets slow.
> 
> Hmm, at a guess, it might be due to prefetching the dnode (root object
> for a file) and block pointers, ie the structure but not the data
> itself.
> 

Yeah, that's possible. But the effects are the same - it doesn't matter
what exactly is not prefetched. But perhaps we could prefetch just a
tiny part of the record, enough to prefetch the dnode+pointers, not the
whole record. Might save some space in ARC, perhaps?

>> This is a bit bizarre, especially because the manifest=on cases are
>> still fast, exactly because the pread + prefetching still happens. I'm
>> sure users would find this puzzling.
>>
>> Unfortunately, the prefetching is not beneficial for all filesystems.
>> For XFS it does not seem to make any difference, but on BTRFS it seems
>> to cause a regression.
>>
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
> 
> Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
> interesting research topic for a potential ZFS patch:
> prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
> reading the relevant source today, but I'll figure it out soonish...)
> 

It's entirely possible I'm just too stupid and it works just fine for
everyone else. But maybe not, and I'd say an implementation that is this
difficult to configure is almost as if it didn't exist at all. The linux
read-ahead works by default pretty great.

So I don't see how to make this work without explicit prefetch ... Of
course, we could also do no prefetch and tell users it's up to ZFS to
make this work, but I don't think it does them any service.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_combinebackup --copy-file-range

От
Jakub Wartak
Дата:
On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> I've been running some benchmarks and experimenting with various stuff,
> trying to improve the poor performance on ZFS, and the regression on XFS
> when using copy_file_range. And oh boy, did I find interesting stuff ...

[..]

Congratulations on great results!

> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
> to finish recovery, run pg_checksums --check (to check the patches don't
> produce something broken)

I've performed some follow-up small testing on all patches mentioned
here  (1..7), with the earlier developed nano-incremental-backup-tests
that helped detect some issues for Robert earlier during original
development. They all went fine in both cases:
- no special options when using pg_combinebackup
- using pg_combinebackup --copy-file-range --manifest-checksums=NONE

Those were:
test_across_wallevelminimal.sh
test_full_pri__incr_stby__restore_on_pri.sh
test_full_pri__incr_stby__restore_on_stby.sh
test_full_stby__incr_stby__restore_on_pri.sh
test_full_stby__incr_stby__restore_on_stby.sh
test_incr_after_timelineincrease.sh
test_incr_on_standby_after_promote.sh
test_many_incrementals_dbcreate_duplicateOID.sh
test_many_incrementals_dbcreate_filecopy_NOINCR.sh
test_many_incrementals_dbcreate_filecopy.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_repro_assert_RP.sh
test_repro_assert.sh
test_standby_incr_just_backup.sh
test_stuck_walsum.sh
test_truncaterollback.sh
test_unlogged_table.sh

> Now to the findings ....
>
>
> 1) block alignment

[..]

> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

If anything is on the table, then I think in the far future
pg_refresh_standby_using_incremental_backup_from_primary would be the
only other tool using the format ?

> 2) prefetch
> -----------
[..]
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Right, we could optionally cover in the docs later-on various options
to get the performance (on XFS use $this, but without $that and so
on). It's kind of madness dealing with all those performance
variations.

Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?

-J.



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.


I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.


On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
> 
> [..]
> 
> Congratulations on great results!
> 
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
> 
> I've performed some follow-up small testing on all patches mentioned
> here  (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
> 
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_incr_after_timelineincrease.sh
> test_incr_on_standby_after_promote.sh
> test_many_incrementals_dbcreate_duplicateOID.sh
> test_many_incrementals_dbcreate_filecopy_NOINCR.sh
> test_many_incrementals_dbcreate_filecopy.sh
> test_many_incrementals_dbcreate.sh
> test_many_incrementals.sh
> test_multixact.sh
> test_pending_2pc.sh
> test_reindex_and_vacuum_full.sh
> test_repro_assert_RP.sh
> test_repro_assert.sh
> test_standby_incr_just_backup.sh
> test_stuck_walsum.sh
> test_truncaterollback.sh
> test_unlogged_table.sh
> 
>> Now to the findings ....
>>

Thanks. Would be great if you could run this on the attached version of
the patches, ideally for each of them independently, so make sure it
doesn't get broken+fixed somewhere on the way.

>>
>> 1) block alignment
> 
> [..]
> 
>> And I think we probably want to do this now, because this affects all
>> tools dealing with incremental backups - even if someone writes a custom
>> version of pg_combinebackup, it will have to deal with misaligned data.
>> Perhaps there might be something like pg_basebackup that "transforms"
>> the data received from the server (and also the backup manifest), but
>> that does not seem like a great direction.
> 
> If anything is on the table, then I think in the far future
> pg_refresh_standby_using_incremental_backup_from_primary would be the
> only other tool using the format ?
> 

Possibly, but I was thinking more about backup solutions using the same
format, but doing the client-side differently. Essentially, something
that would still use the server side to generate incremental backups,
but replace pg_combinebackup to do this differently (stream the data
somewhere else, index it somehow, or whatever).

>> 2) prefetch
>> -----------
> [..]
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
> 
> Right, we could optionally cover in the docs later-on various options
> to get the performance (on XFS use $this, but without $that and so
> on). It's kind of madness dealing with all those performance
> variations.
> 

Yeah, something like that. I'm not sure we want to talk too much about
individual filesystems in our docs, because those things evolve over
time too. And also this depends on how large the increment is. If it
only modifies 1% of the blocks, then 99% will come from the full backup,
and the sequential prefetch should do OK (well, maybe not on ZFS). But
as the incremental backup gets larger / is more random, the prefetch is
more and more important.

> Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
> getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?
> 

I did think about that, but there's a dependency on the batching. If
we're prefetching ~1MB of data, we may need to prefetch up to ~1MB
ahead. Because otherwise we might want to read 1MB and only a tiny part
of that would be prefetched. I was thinking maybe we could skip the
sequential parts, but ZFS does need that.

So I don't think we can just allow users to set arbitrary values, at
least not without also tweaking the batch. Or maybe 1MB batches are too
large, and we should use something smaller? I need to think about this a
bit more ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_combinebackup --copy-file-range

От
Jakub Wartak
Дата:
On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Here's a much more polished and cleaned up version of the patches,
> fixing all the issues I've been aware of, and with various parts merged
> into much more cohesive parts (instead of keeping them separate to make
> the changes/evolution more obvious).

OK, so three runs of incrementalbackupstests - as stated earlier -
also passed with OK for v20240403 (his time even with
--enable-casserts)

pg_combinebackup flags tested were:
1) --copy-file-range --manifest-checksums=CRC32C
2) --copy-file-range --manifest-checksums=NONE
3) default, no flags (no copy-file-range)

> I changed how I think about this a bit - I don't really see the CoW copy
> methods as necessary faster than the regular copy (even though it can be
> with (5)). I think the main benefits are the space savings, enabled by
> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
> that I don't think the performance is an issue - everything has a cost.

I take i differently: incremental backups without CoW fs would be clearly :
- inefficient in terms of RTO (SANs are often a key thing due to
having fast ability to "restore" the clone rather than copying the
data from somewhere else)
- pg_basebackup without that would be unusuable without space savings
(e.g. imagine daily backups @ 10+TB DWHs)

> On 4/3/24 15:39, Jakub Wartak wrote:
> > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
[..]
> Thanks. Would be great if you could run this on the attached version of
> the patches, ideally for each of them independently, so make sure it
> doesn't get broken+fixed somewhere on the way.

Those are semi-manual test runs (~30 min? per run), the above results
are for all of them applied at once. So my take is all of them work
each one does individually too.

FWIW, I'm also testing your other offlist incremental backup
corruption issue, but that doesnt seem to be related in any way to
copy_file_range() patches here.

> >> 2) prefetch
> >> -----------
> > [..]
> >> I think this means we may need a "--prefetch" option, that'd force
> >> prefetching, probably both before pread and copy_file_range. Otherwise
> >> people on ZFS are doomed and will have poor performance.
> >
> > Right, we could optionally cover in the docs later-on various options
> > to get the performance (on XFS use $this, but without $that and so
> > on). It's kind of madness dealing with all those performance
> > variations.
> >
>
> Yeah, something like that. I'm not sure we want to talk too much about
> individual filesystems in our docs, because those things evolve over
> time too.

Sounds like Wiki then.

BTW, after a quick review: could we in 05 have something like common
value then (to keep those together via some .h?)

#define        BATCH_SIZE PREFETCH_TARGET ?

-J.



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
On 4/4/24 12:25, Jakub Wartak wrote:
> On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> Here's a much more polished and cleaned up version of the patches,
>> fixing all the issues I've been aware of, and with various parts merged
>> into much more cohesive parts (instead of keeping them separate to make
>> the changes/evolution more obvious).
> 
> OK, so three runs of incrementalbackupstests - as stated earlier -
> also passed with OK for v20240403 (his time even with
> --enable-casserts)
> 
> pg_combinebackup flags tested were:
> 1) --copy-file-range --manifest-checksums=CRC32C
> 2) --copy-file-range --manifest-checksums=NONE
> 3) default, no flags (no copy-file-range)
> 

Thanks!

>> I changed how I think about this a bit - I don't really see the CoW copy
>> methods as necessary faster than the regular copy (even though it can be
>> with (5)). I think the main benefits are the space savings, enabled by
>> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
>> that I don't think the performance is an issue - everything has a cost.
> 
> I take i differently: incremental backups without CoW fs would be clearly :
> - inefficient in terms of RTO (SANs are often a key thing due to
> having fast ability to "restore" the clone rather than copying the
> data from somewhere else)
> - pg_basebackup without that would be unusuable without space savings
> (e.g. imagine daily backups @ 10+TB DWHs)
> 

Right, although this very much depends on the backup scheme. If you only
take incremental backups, and then also a full backup once in a while,
the CoW stuff probably does not help much. The alignment (the only thing
affecting basebackups) may allow deduplication, but that's all I think.

If the scheme is more complex, and involves "merging" the increments
into the full backup, then this does help a lot. It'd even be possible
to cheaply clone instances this way, I think. But I'm not sure how often
would people do that on the same volume, to benefit from the CoW.

>> On 4/3/24 15:39, Jakub Wartak wrote:
>>> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
> [..]
>> Thanks. Would be great if you could run this on the attached version of
>> the patches, ideally for each of them independently, so make sure it
>> doesn't get broken+fixed somewhere on the way.
> 
> Those are semi-manual test runs (~30 min? per run), the above results
> are for all of them applied at once. So my take is all of them work
> each one does individually too.
> 

Cool, thanks.

> FWIW, I'm also testing your other offlist incremental backup
> corruption issue, but that doesnt seem to be related in any way to
> copy_file_range() patches here.
> 

Yes, that's entirely independent, happens with master too.

>>>> 2) prefetch
>>>> -----------
>>> [..]
>>>> I think this means we may need a "--prefetch" option, that'd force
>>>> prefetching, probably both before pread and copy_file_range. Otherwise
>>>> people on ZFS are doomed and will have poor performance.
>>>
>>> Right, we could optionally cover in the docs later-on various options
>>> to get the performance (on XFS use $this, but without $that and so
>>> on). It's kind of madness dealing with all those performance
>>> variations.
>>>
>>
>> Yeah, something like that. I'm not sure we want to talk too much about
>> individual filesystems in our docs, because those things evolve over
>> time too.
> 
> Sounds like Wiki then.
> 
> BTW, after a quick review: could we in 05 have something like common
> value then (to keep those together via some .h?)
> 
> #define        BATCH_SIZE PREFETCH_TARGET ?
> 

Yes, that's one of the things I'd like to refine a bit more. Making it
more consistent / clearer that these things are interdependent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
Hi,

I have pushed the three patches of this series - the one that aligns
blocks, and the two adding clone/copy_file_range to pg_combinebackup.
The committed versions are pretty much the 2024/04/03 version, with
various minor cleanups (e.g. I noticed the docs are still claiming the
copy methods work only without checksum calculations, but that's no
longer true). I also changed the parameter order to keep the dry_run and
debug parameters last, it seems nicer this way.

The buildfarm reported two compile-time problems, both of them entirely
avoidable (reported by cfbot but I failed to notice that). Should have
known better ...

Anyway, with these patches committed, pg_combinebackup can use CoW stuff
to combine backups cheaply (especially in disk-space terms).

The first patch (block alignment) however turned out to be important
even for non-CoW filesystems, in some cases. I did a lot of benchmarks
with the standard block-by-block copying of data, and on a machine with
SSD RAID storage the duration went from ~400 seconds for some runs to
only about 150 seconds (with aligned blocks). My explanation is that
with the misaligned blocks the RAID often has to access two devices to
read a block, and the alignment makes that go away.

In the attached PDF with results (duration.pdf), showing the duration of
pg_combinebackup on an increment of a particular size (1%, 10% or 20%),
this is visible as a green square on the right. Those columns are
results relative to a baseline - which for "copy" is master before the
block alignment patch, and for "copy_file_range" it's the 3-reconstruct
(adding copy_file_range to combining blocks from increments).

FWIW the last three columns are a comparison with prefetching enabled.

There's a couple interesting observations from this, based on which I'm
not going to try to get the remaining patches (batching and prefetching)
into v17. It clearly needs more analysis to make the right tradeoff.

From the table, I think it's clear that:

0) The impact of block alignment on RAID storage, with regular copy.

1) The batching (origina patch 0005) either does not help the regular
copy, or it actually makes it slower. The PDF is a bit misleading
because it seems to suggest the i5 machine is unaffected, while the xeon
gets ~30% slower. But that's just an illusion - the comparison is to
master, but the alignment patch made i5 about 2x faster. So it's 200%
slower when compared to "current master" with the alignment patch.

That's not great :-/ And also a bit strange - I would have expected the
batching to help the simple copy too. I haven't looked into why this
happens, so there's a chance I made some silly mistake, who knows.

For the copy_file_range case the batching is usually very beneficial,
sometimes reducing the duration to a fraction of the non-batched case.

My interpretation is that (unless there's a bug in the patch) we may
need two variants of that code - a non-batched one for regular copy, and
a batched variant for copy_file_range.

2) The prefetching is not a huge improvement, at least not for these
three filesystems (btrfs, ext4, xfs). From the color scale it might seem
like it helps, but those values are relative to the baseline, so when
the non-prefetching value is 5% and with prefetching 10%, that means the
prefetching makes it slower. And that's very often true.

This is visible more clearly in prefetching.pdf, comparing the
non-prefetching and prefetching results for each patch, not to baseline.
That's makes it quite clear there's a lot of "red" where prefetching
makes it slower. It certainly does help for larger increments (which
makes sense, because the modified blocks are distributed randomly, and
thus come from random files, making long streaks unlikely).

I've imagined the prefetching could be made a bit smarter to ignore the
streaks (=sequential patterns), but once again - this only matters with
the batching, which we don't have. And without the batching it looks
like a net loss (that's the first column in the prefetching PDF).

I did start thinking about prefetching because of ZFS, where it was
necessary to get decent performance. And that's still true. But (a) even
with the explicit prefetching it's still 2-3x slower than any of these
filesystems, so I assume performance-sensitive use cases won't use it.
And (b) the prefetching seems necessary in all cases, no matter how
large the increment is. Which goes directly against the idea of looking
at how random the blocks are and prefetching only the sufficiently
random patterns. That doesn't seem like a great thing.

3) There's also the question of disk space usage. The size.pdf shows how
the patches affect space needed for the pg_combinebackup result. It does
depend a bit on the internal fs cleanup for each run, but it seems the
batching makes a difference - clearly copying 1MB blocks instead of 8kB
allows lower overhead for some filesystems (e.g. btrfs, where we get
from ~1.5GB to a couple MBs). But the space savings are quite negligible
compared to just using --copy-file-range option (where we get from 75GB
to 1.5GB). I think the batching is interesting mostly because of the
substantial duration reduction.

I'm also attaching the benchmarking script I used (warning: ugly!), and
results for the three filesystems. For ZFS I only have partial results
so far, because it's so slow, but in general - without prefetching it's
slow (~1000s) with prefetching it's better but still slow (~250s).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
On 4/5/24 21:43, Tomas Vondra wrote:
> Hi,
> 
> ...
> 
> 2) The prefetching is not a huge improvement, at least not for these
> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
> like it helps, but those values are relative to the baseline, so when
> the non-prefetching value is 5% and with prefetching 10%, that means the
> prefetching makes it slower. And that's very often true.
> 
> This is visible more clearly in prefetching.pdf, comparing the
> non-prefetching and prefetching results for each patch, not to baseline.
> That's makes it quite clear there's a lot of "red" where prefetching
> makes it slower. It certainly does help for larger increments (which
> makes sense, because the modified blocks are distributed randomly, and
> thus come from random files, making long streaks unlikely).
> 
> I've imagined the prefetching could be made a bit smarter to ignore the
> streaks (=sequential patterns), but once again - this only matters with
> the batching, which we don't have. And without the batching it looks
> like a net loss (that's the first column in the prefetching PDF).
> 
> I did start thinking about prefetching because of ZFS, where it was
> necessary to get decent performance. And that's still true. But (a) even
> with the explicit prefetching it's still 2-3x slower than any of these
> filesystems, so I assume performance-sensitive use cases won't use it.
> And (b) the prefetching seems necessary in all cases, no matter how
> large the increment is. Which goes directly against the idea of looking
> at how random the blocks are and prefetching only the sufficiently
> random patterns. That doesn't seem like a great thing.
> 

I finally got a more complete ZFS results, and I also decided to get
some numbers without the ZFS tuning I did. And boy oh boy ...

All the tests I did with ZFS were tuned the way I've seen recommended
when using ZFS for PostgreSQL, that is

  zfs set recordsize=8K logbias=throughput compression=none

and this performed quite poorly - pg_combinebackup took 4-8x longer than
with the traditional filesystems (btrfs, xfs, ext4) and the only thing
that improved that measurably was prefetching.

But once I reverted back to the default recordsize of 128kB the
performance is waaaaaay better - entirely comparable to ext4/xfs, while
btrfs remains faster with --copy-file-range --no-manigest (by a factor
of 2-3x).

This is quite clearly visible in the attached "current.pdf" which shows
results for the current master (i.e. filtered to the 3-reconstruct patch
adding CoW stuff to write_reconstructed_file).

There's also some differences in the disk usage, where ZFS seems to need
more space than xfs/btrfs (as if there was no block sharing), but maybe
that's due to how I measure this using df ...

I also tried also "completely default" ZFS configuration, with all
options left at the default (recordsize=128kB, compression=lz4, and
logbias=latency). That performs about the same, except that the disk
usage is lower thanks to the compression.

note: Because I'm hip cool kid, I also ran the tests on bcachefs. The
results are included in the CSV/PDF attachments. In general it's much
slower than xfs/btrfs/ext, and the disk space is somewhere in between
btrfs and xfs (for the CoW cases). We'll see how this improves as it
matures in the future.

The attachments are tables with the total duration / disk space usage,
and impact of prefetching. The tables are similar to what I shared
before, except that the color scale is applied to the values directly.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: pg_combinebackup --copy-file-range

От
Tomas Vondra
Дата:
On 4/7/24 19:46, Tomas Vondra wrote:
> On 4/5/24 21:43, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> 2) The prefetching is not a huge improvement, at least not for these
>> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
>> like it helps, but those values are relative to the baseline, so when
>> the non-prefetching value is 5% and with prefetching 10%, that means the
>> prefetching makes it slower. And that's very often true.
>>
>> This is visible more clearly in prefetching.pdf, comparing the
>> non-prefetching and prefetching results for each patch, not to baseline.
>> That's makes it quite clear there's a lot of "red" where prefetching
>> makes it slower. It certainly does help for larger increments (which
>> makes sense, because the modified blocks are distributed randomly, and
>> thus come from random files, making long streaks unlikely).
>>
>> I've imagined the prefetching could be made a bit smarter to ignore the
>> streaks (=sequential patterns), but once again - this only matters with
>> the batching, which we don't have. And without the batching it looks
>> like a net loss (that's the first column in the prefetching PDF).
>>
>> I did start thinking about prefetching because of ZFS, where it was
>> necessary to get decent performance. And that's still true. But (a) even
>> with the explicit prefetching it's still 2-3x slower than any of these
>> filesystems, so I assume performance-sensitive use cases won't use it.
>> And (b) the prefetching seems necessary in all cases, no matter how
>> large the increment is. Which goes directly against the idea of looking
>> at how random the blocks are and prefetching only the sufficiently
>> random patterns. That doesn't seem like a great thing.
>>
> 
> I finally got a more complete ZFS results, and I also decided to get
> some numbers without the ZFS tuning I did. And boy oh boy ...
> 
> All the tests I did with ZFS were tuned the way I've seen recommended
> when using ZFS for PostgreSQL, that is
> 
>   zfs set recordsize=8K logbias=throughput compression=none
> 
> and this performed quite poorly - pg_combinebackup took 4-8x longer than
> with the traditional filesystems (btrfs, xfs, ext4) and the only thing
> that improved that measurably was prefetching.
> 
> But once I reverted back to the default recordsize of 128kB the
> performance is waaaaaay better - entirely comparable to ext4/xfs, while
> btrfs remains faster with --copy-file-range --no-manigest (by a factor
> of 2-3x).
> 

I forgot to explicitly say that I think confirms the decision to not
push the patch adding the explicit prefetching to pg_combinebackup. It's
not needed/beneficial even for ZFS, when using a suitable configuration.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company