Обсуждение: pg_upgrade --copy-file-range

Поиск

Список

Период

Сортировка

pg_upgrade --copy-file-range

От

Thomas Munro

Дата:

02 июня 2023 г., 22:30:44

Hello,

I was just in a pg_upgrade unconference session at PGCon where the
lack of $SUBJECT came up.  This system call gives the kernel the
option to use fast block cloning on XFS, ZFS (as of very recently),
etc, and works on Linux and FreeBSD.  It's probably much the same as
--clone mode on COW file systems, except that is Linux-only.  On
overwrite file systems (ie not copy-on-write, like ext4), it may also
be able to push copies down to storage hardware/network file systems.

There was something like this in the nearby large files patch set, but
in that version it just magically did it when available in --copy
mode.  Now I think the user should have to have to opt in with
--copy-file-range, and simply to error out if it fails.  It may not
work in some cases -- for example, the man page says that older Linux
systems can fail with EXDEV when you try to copy across file systems,
while newer systems will do something less efficient but still
sensible internally; also I saw a claim that some older versions had
weird bugs.  Better to just expose the raw functionality and let users
say when they want it and read the error if it fail, I think.

Вложения

0001-Add-copy-file-range-option-to-pg_upgrade.patch

Re: pg_upgrade --copy-file-range

От

Peter Eisentraut

Дата:

03 июля 2023 г., 10:47:05

On 02.06.23 21:30, Thomas Munro wrote:
> I was just in a pg_upgrade unconference session at PGCon where the
> lack of $SUBJECT came up.  This system call gives the kernel the
> option to use fast block cloning on XFS, ZFS (as of very recently),
> etc, and works on Linux and FreeBSD.  It's probably much the same as
> --clone mode on COW file systems, except that is Linux-only.  On
> overwrite file systems (ie not copy-on-write, like ext4), it may also
> be able to push copies down to storage hardware/network file systems.
> 
> There was something like this in the nearby large files patch set, but
> in that version it just magically did it when available in --copy
> mode.  Now I think the user should have to have to opt in with
> --copy-file-range, and simply to error out if it fails.  It may not
> work in some cases -- for example, the man page says that older Linux
> systems can fail with EXDEV when you try to copy across file systems,
> while newer systems will do something less efficient but still
> sensible internally; also I saw a claim that some older versions had
> weird bugs.  Better to just expose the raw functionality and let users
> say when they want it and read the error if it fail, I think.

When we added --clone, copy_file_range() was available, but the problem 
was that it was hard for the user to predict whether you'd get the fast 
clone behavior or the slow copy behavior.  That's the kind of thing you 
want to know when planning and testing your upgrade.  At the time, there 
were patches passed around in Linux kernel circles that would have been 
able to enforce cloning via the flags argument of copy_file_range(), but 
that didn't make it to the mainline.

So, yes, being able to specify exactly which copy mechanism to use makes 
sense, so that users can choose the tradeoffs.

About your patch:

I think you should have a "check" function called from 
check_new_cluster().  That check function can then also handle the "not 
supported" case, and you don't need to handle that in 
parseCommandLine().  I suggest following the clone example for these, 
since the issues there are very similar.

Re: pg_upgrade --copy-file-range

От

Thomas Munro

Дата:

08 октября 2023 г., 08:15:18

On Mon, Jul 3, 2023 at 7:47 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> When we added --clone, copy_file_range() was available, but the problem
> was that it was hard for the user to predict whether you'd get the fast
> clone behavior or the slow copy behavior.  That's the kind of thing you
> want to know when planning and testing your upgrade.  At the time, there
> were patches passed around in Linux kernel circles that would have been
> able to enforce cloning via the flags argument of copy_file_range(), but
> that didn't make it to the mainline.
>
> So, yes, being able to specify exactly which copy mechanism to use makes
> sense, so that users can choose the tradeoffs.

Thanks for looking.  Yeah, it is quite inconvenient for planning
purposes that it is hard for a user to know which internal strategy it
uses, but that's the interface we have (and clearly "flags" is
reserved for future usage so that might still evolve..).

> About your patch:
>
> I think you should have a "check" function called from
> check_new_cluster().  That check function can then also handle the "not
> supported" case, and you don't need to handle that in
> parseCommandLine().  I suggest following the clone example for these,
> since the issues there are very similar.

Done.

Вложения

v2-0001-Add-copy-file-range-option-to-pg_upgrade.patch

Re: pg_upgrade --copy-file-range

От

Peter Eisentraut

Дата:

13 ноября 2023 г., 10:15:01

On 08.10.23 07:15, Thomas Munro wrote:
>> About your patch:
>>
>> I think you should have a "check" function called from
>> check_new_cluster().  That check function can then also handle the "not
>> supported" case, and you don't need to handle that in
>> parseCommandLine().  I suggest following the clone example for these,
>> since the issues there are very similar.
> 
> Done.

This version looks good to me.

Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a 
different suffix.

Re: pg_upgrade --copy-file-range

От

Peter Eisentraut

Дата:

22 декабря 2023 г., 23:40:48

On 13.11.23 08:15, Peter Eisentraut wrote:
> On 08.10.23 07:15, Thomas Munro wrote:
>>> About your patch:
>>>
>>> I think you should have a "check" function called from
>>> check_new_cluster().  That check function can then also handle the "not
>>> supported" case, and you don't need to handle that in
>>> parseCommandLine().  I suggest following the clone example for these,
>>> since the issues there are very similar.
>>
>> Done.
> 
> This version looks good to me.
> 
> Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a 
> different suffix.

Thomas, are you planning to proceed with this patch?

Re: pg_upgrade --copy-file-range

От

Thomas Munro

Дата:

22 декабря 2023 г., 23:52:59

On Sat, Dec 23, 2023 at 9:40 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 13.11.23 08:15, Peter Eisentraut wrote:
> > On 08.10.23 07:15, Thomas Munro wrote:
> >>> About your patch:
> >>>
> >>> I think you should have a "check" function called from
> >>> check_new_cluster().  That check function can then also handle the "not
> >>> supported" case, and you don't need to handle that in
> >>> parseCommandLine().  I suggest following the clone example for these,
> >>> since the issues there are very similar.
> >>
> >> Done.
> >
> > This version looks good to me.
> >
> > Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a
> > different suffix.
>
> Thomas, are you planning to proceed with this patch?

Yes.  Sorry for being slow... got stuck working on an imminent new
version of streaming read.  I will be defrosting my commit bit and
committing this one and a few things shortly.

As it happens I was just thinking about this particular patch because
I suddenly had a strong urge to teach pg_combinebackup to use
copy_file_range.  I wonder if you had the same idea...

Re: pg_upgrade --copy-file-range

От

Michael Paquier

Дата:

24 декабря 2023 г., 05:57:19

On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:
> As it happens I was just thinking about this particular patch because
> I suddenly had a strong urge to teach pg_combinebackup to use
> copy_file_range.  I wonder if you had the same idea...

Yeah, +1.  That would make copy_file_blocks() more efficient where the
code is copying 50 blocks in batches because it needs to reassign
checksums to the blocks copied.
--
Michael

Вложения

signature.asc

Re: pg_upgrade --copy-file-range

От

Jakub Wartak

Дата:

05 января, 15:40:45

Hi Thomas, Michael, Peter and -hackers,

On Sun, Dec 24, 2023 at 3:57 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:
> > As it happens I was just thinking about this particular patch because
> > I suddenly had a strong urge to teach pg_combinebackup to use
> > copy_file_range.  I wonder if you had the same idea...
>
> Yeah, +1.  That would make copy_file_blocks() more efficient where the
> code is copying 50 blocks in batches because it needs to reassign
> checksums to the blocks copied.

I've tried to achieve what you were discussing. Actually this was my
first thought when using pg_combinebackup with larger (realistic)
backup sizes back in December. Attached is a set of very DIRTY (!)
patches that provide CoW options (--clone/--copy-range-file) to
pg_combinebackup (just like pg_upgrade to keep it in sync), while also
refactoring some related bits of code to avoid duplication.

With XFS (with reflink=1 which is default) on Linux with kernel 5.10
and ~210GB backups, I'm getting:

root@jw-test-1:/xfs# du -sm *
210229  full
250     incr.1

Today in master, the old classic read()/while() loop without
CoW/reflink optimization :
root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE -o outtest full incr.1
3

real    49m43.963s
user    0m0.887s
sys     2m52.697s

VS patch with "--clone" :

root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE --clone -o outtest full incr.1
3

real    0m39.812s
user    0m0.325s
sys     0m2.401s

So it is 49mins down to 40 seconds(!) +/-10s (3 tries) if the FS
supports CoW/reflinks (XFS, BTRFS, upcoming bcachefs?). It looks to me
that this might mean that if one actually wants to use incremental
backups (to get minimal RTO), it would be wise to only use CoW
filesystems from the start so that RTO is as low as possible.

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.

Any help is appreciated.

-J.

Hi,

I took a quick look at the remaining part adding copy_file_range to
pg_combinebackup. The patch no longer applies, so I had to rebase it.
Most of the issues were trivial, but I had to fix a couple missing
prototypes - I added them to copy_file.h/c, mostly.

0001 is the minimal rebase + those fixes

0002 has a couple review comments in copy_file, and it also undoes a lot
of unnecessary formatting changes (already pointed out by Peter a couple
days ago).

A couple review comments:

1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.

2) I wonder if we even need opt_errinfo(). I'm not sure it actually
makes anything simpler.

3) I think it'd be nice to make CopyFileMethod more consistent with
transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
more consistent, it's probably not worth unifying this somehow).

4) I wonder how we came up with copying the files by 50 blocks, but I
now realize it's been like this before this patch. I only noticed
because the patch adds a comment before buffer_size calculation.

5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
name is way more generic / less descriptive - it's clear it copies the
file block by block (well, in chunks). pg_copyfile is pretty vague.

6) This leaves behind copy_file_copyfile, which is now unused.

7) The patch reworks how combinebackup deals with alternative copy
implementations - instead of setting strategy_implementation and calling
that, the decisions now happen in pg_copyfile_offload with a lot of
conditions / ifdef / defined ... I find it pretty hard to understand and
reason about. I liked the strategy_implementation approach, as it forces
us to keep each method in a separate function.

Perhaps there's a reason why that doesn't work for copy_file_range? But
in that case this needs much clearer comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: pg_upgrade --copy-file-range

От

Jakub Wartak

Дата:

20 марта, 17:17:28

Hi Tomas,

> I took a quick look at the remaining part adding copy_file_range to
> pg_combinebackup. The patch no longer applies, so I had to rebase it.
> Most of the issues were trivial, but I had to fix a couple missing
> prototypes - I added them to copy_file.h/c, mostly.
>
> 0001 is the minimal rebase + those fixes
>
> 0002 has a couple review comments in copy_file, and it also undoes a lot
> of unnecessary formatting changes (already pointed out by Peter a couple
> days ago).
>

Thank you very much for this! As discussed privately, I'm not in
position right now to pursue this further at this late stage (at least
for v17, which would require an aggressive schedule ). My plan was
more for v18 after Peter's email, due to other obligations. But if you
have cycles and want to continue, please do so without hesitation -
I'll try to chime in a long way to test and review for sure.

> A couple review comments:
>
> 1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.
>
> 2) I wonder if we even need opt_errinfo(). I'm not sure it actually
> makes anything simpler.

Yes, as it stands it's broken (somewhat I've missed gcc warning),
should be pg_malloc(). I hardly remember, but I wanted to avoid code
duplication. No strong opinion, maybe that's a different style, I'll
adapt as necessary.

> 3) I think it'd be nice to make CopyFileMethod more consistent with
> transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
> more consistent, it's probably not worth unifying this somehow).
>
> 4) I wonder how we came up with copying the files by 50 blocks, but I
> now realize it's been like this before this patch. I only noticed
> because the patch adds a comment before buffer_size calculation.

It looks like it was like that before pg_upgrade even was moved into
the core. 400kB is indeed bit strange value, so we can leave it as it
is or make the COPY_BUF_SIZ 128kb - see [1] (i've double checked cp(1)
uses still 128kB today), or maybe just stick to something like 256 or
512 kBs.

> 5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
> name is way more generic / less descriptive - it's clear it copies the
> file block by block (well, in chunks). pg_copyfile is pretty vague.
>
> 6) This leaves behind copy_file_copyfile, which is now unused.
>
> 7) The patch reworks how combinebackup deals with alternative copy
> implementations - instead of setting strategy_implementation and calling
> that, the decisions now happen in pg_copyfile_offload with a lot of
> conditions / ifdef / defined ... I find it pretty hard to understand and
> reason about. I liked the strategy_implementation approach, as it forces
> us to keep each method in a separate function.

Well some context (maybe it was my mistake to continue in this
./thread rather starting a new one): my plan was 3-in-1: in the
original proposal (from Jan) to provide CoW as generic facility for
other to use - in src/common/file_utils.c as per
v3-0002-Confine-various-OS-copy-on-write-and-other-copy-a.patch - to
unify & confine CoW methods and their quirkiness between
pg_combinebackup and pg_upgrade and other potential CoW uses too. That
was before Thomas M. pushed CoW just for pg_upgrade as
d93627bcbe5001750e7611f0e637200e2d81dcff. I had this idea back then to
have pg_copyfile() [normal blocks copy] and
pg_copyfile_offload_supported(),
pg_copyfile_offload(PG_COPYFILE_IOCTL_FICLONE ,
PG_COPYFILE_COPY_FILE_RANGE,
PG_COPYFILE_who_has_idea_what_they_come_up_with_in_future). In Your's
version of the patch it's local to pg_combinebackup, so it might make
no sense after all. If you look at the pg_upgrade and pg_combinebackup
they both have code duplication with lots of those ifs/IFs (assuming
user wants to have it as drop-in [--clone/--copy/--copyfile] and
platform may / may not have it). I've even considered
--cow=ficlone|copy_file_range to sync both tools from CLI arguments
point of view, but that would break backwards compatibility, so I did
not do that.

Also there's a problem with pg_combinebackup's strategy_implementation
that it actually cannot on its own decide (I think) which CoW to use
or not. There were some longer discussions that settled on one thing
(for pg_upgrade): it's the user who is in control HOW the copy gets
done (due to potential issues in OS CoW() implementations where e.g.
if NFS would be involved on one side). See pg_upgrade
--clone/--copy/--copy-file-range/--sync-method options. I wanted to
stick to that, so pg_combinebackup also needs to give the same options
to the user.

That's was for the historical context, now you wrote "it's probably
not worth unifying this somehow" few sentences earlier, so my take is
the following: we can just concentrate on getting the
copy_file_range() and ioctl_ficlone to pg_combinebackup at the price
of duplicating some complexity for now (in short to start with clear
plate , it doesn't necessary needs to be my patch as base if we think
it's worthwhile for v17 - or stick to your reworked patch of mine).

Later (v18?) some bigger than this refactor could unify and move the
copy methods to some more central place (so then we would have sync as
there would be no doubling like you mentioned e.g.: pg_upgrade's enum
transferMode <-> patch enum CopyFileMethod.

So for now I'm +1 to renaming all the things as you want -- indeed
pg_copy* might not be a good fit in a localized version.

-J.

[1] - https://eklitzke.org/efficient-file-copying-on-linux

Re: pg_upgrade --copy-file-range

От

Tomas Vondra

Дата:

22 марта, 17:40:40

Here's a patch reworked along the lines from a couple days ago.

The primary goals were to add clone/copy_file_range while minimizing
unnecessary disruption, and overall cleanup of the patch. I'm not saying
it's committable, but I think the patch is much easier to understand.

The main change is that this abandons the idea of handling all possible
cases in a single function that looks like a maze of ifdefs, and instead
separates each case into it's own function and the decision happens much
earlier. This is pretty much exactly what pg_upgrade does, BTW.

There's maybe an argument that these functions could be unified and
moved to a library in src/common - I can imagine doing that, but I don't
think it's required. The functions are pretty trivial wrappers, and it's
not like we expect many more callers. And there's probably stuff we'd
need to keep out of that library (e.g. the decision which copy/clone
methods are available / should be used or error reporting). So it
doesn't seem worth it, at least for now.

There's one question, though. As it stands, there's a bit of asymmetry
between handling CopyFile() on WIN32 and the clone/copy_file_range on
other platforms). On WIN32, we simply automatically switch to CopyFile
automatically, if we don't need to calculate checksum. But with the
other methods, error out if the user requests those and we need to
calculate the checksum.

The asymmetry comes from the fact there's no option to request CopyFile
on WIN32, and we feel confident it's always the right thing to do (safe,
faster). We don't seem to know that for the other methods, so the user
has to explicitly request those. And if the user requests --clone, for
example, it'd be wrong to silently fallback to plain copy.

Still, I wonder if this might cause some undesirable issues during
restores. But I guess that's why we have --dry-run.

This asymmetry also shows a bit in the code - the CopyFile is coded and
called a bit differently from the other methods. FWIW I abandoned the
approach with "strategy" and just use a switch on CopyMode enum, just
like pg_upgrade does.

There's a couple more smaller changes:

- Addition of docs for --clone/--copy-file-range (shameless copy from
pg_upgrade docs).

- Removal of opt_errinfo - not only was it buggy, I think the code is
actually cleaner without it.

- Removal of EINTR retry condition from copy_file_range handling (this
is what Thomas ended up for pg_upgrade while committing that part).

Put together, this cuts the patch from ~40kB to ~15kB (most of this is
due to the cleanup of unnecessary whitespace changes, though).

I think to make this committable, this requires some review and testing,
ideally on a range of platforms.

One open question is how to allow testing this. For pg_upgrade we now
have PG_TEST_PG_UPGRADE_MODE, which can be set to e.g. "--clone". I
wonder if we should add PG_TEST_PG_COMBINEBACKUP_MODE ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

v20240322-0001-pg_combinebackup-allow-using-clone-copy_fi.patch

Re: pg_upgrade --copy-file-range

От

Robert Haas

Дата:

22 марта, 19:42:54

On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> There's one question, though. As it stands, there's a bit of asymmetry
> between handling CopyFile() on WIN32 and the clone/copy_file_range on
> other platforms). On WIN32, we simply automatically switch to CopyFile
> automatically, if we don't need to calculate checksum. But with the
> other methods, error out if the user requests those and we need to
> calculate the checksum.

That seems completely broken. copy_file() needs to have the ability to
calculate a checksum if one is required; when one isn't required, it
can do whatever it likes. So we should always fall back to the
block-by-block method if we need a checksum. Whatever option the user
specified should only be applied when we don't need a checksum.

Consider, for example:

pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
pg_basebackup -D monday -c fast --manifest-checksums=SHA224
--incremental sunday/backup_manifest
pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone

Any files that are copied in their entirety from Sunday's backup can
be cloned, if we have support for cloning. But any files copied from
Monday's backup will need to be re-checksummed, since the checksum
algorithms don't match. With what you're describing, it sounds like
pg_combinebackup would just fail in this case; I don't think that's
the behavior that the user is going to want.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade --copy-file-range

От

Tomas Vondra

Дата:

22 марта, 20:22:29

On 3/22/24 17:42, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> There's one question, though. As it stands, there's a bit of asymmetry
>> between handling CopyFile() on WIN32 and the clone/copy_file_range on
>> other platforms). On WIN32, we simply automatically switch to CopyFile
>> automatically, if we don't need to calculate checksum. But with the
>> other methods, error out if the user requests those and we need to
>> calculate the checksum.
> 
> That seems completely broken. copy_file() needs to have the ability to
> calculate a checksum if one is required; when one isn't required, it
> can do whatever it likes. So we should always fall back to the
> block-by-block method if we need a checksum. Whatever option the user
> specified should only be applied when we don't need a checksum.
> 
> Consider, for example:
> 
> pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
> pg_basebackup -D monday -c fast --manifest-checksums=SHA224
> --incremental sunday/backup_manifest
> pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone
> 
> Any files that are copied in their entirety from Sunday's backup can
> be cloned, if we have support for cloning. But any files copied from
> Monday's backup will need to be re-checksummed, since the checksum
> algorithms don't match. With what you're describing, it sounds like
> pg_combinebackup would just fail in this case; I don't think that's
> the behavior that the user is going to want.
> 

Right, this will happen:

  pg_combinebackup: error: unable to use accelerated copy when manifest
  checksums are to be calculated. Use --no-manifest

Are you saying we should just silently override the copy method and do
the copy block by block? I'm not strongly opposed to that, but it feels
wrong to just ignore that the user explicitly requested cloning, and I'm
not sure why should this be different from any other case when the user
requests incompatible combination of options and/or options that are not
supported on the current configuration.

Why not just to tell the user to use the correct parameters, i.e. either
remove --clone or add --no-manifest?

FWIW I now realize it actually fails a bit earlier than I thought - when
parsing the options, not in copy_file. But then some checks (if a given
copy method is supported) happen in the copy functions. I wonder if it'd
be better/possible to do all of that in one place, not sure.

Also, the message only suggests to use --no-manifest. It probably should
suggest removing --clone too.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade --copy-file-range

От

Robert Haas

Дата:

22 марта, 21:40:12

On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Right, this will happen:
>
>   pg_combinebackup: error: unable to use accelerated copy when manifest
>   checksums are to be calculated. Use --no-manifest
>
> Are you saying we should just silently override the copy method and do
> the copy block by block?

Yes.

> I'm not strongly opposed to that, but it feels
> wrong to just ignore that the user explicitly requested cloning, and I'm
> not sure why should this be different from any other case when the user
> requests incompatible combination of options and/or options that are not
> supported on the current configuration.

I don't feel like copying block-by-block when that's needed to compute
a checksum is ignoring what the user requested. I mean, if we'd had to
perform reconstruction rather than copying an entire file, we would
have done that regardless of whether --clone had been there, and I
don't see why the need-to-compute-a-checksum case is any different. I
think we should view a flag like --clone as specifying how to copy a
file when we don't need to do anything but copy it. I don't think it
should dictate that we're not allowed to perform other processing when
that other processing is required.

From my point of view, this is not a case of incompatible options
having been specified. If you specify run pg_basebackup with both
--format=p and --format=t, those are incompatible options; the backup
can be done one way or the other, but not both at once. But here
there's no such conflict. Block-by-block copying and fast-copying can
happen as part of the same operation, as in the example that I showed,
where some files need the block-by-block copying and some can be
fast-copied. The user is entitled to specify which fast-copying method
they would like to have used for the files where fast-copying is
possible without getting a failure just because it isn't possible for
every single file.

Or to say it the other way around, if there's 1 file that needs to be
copied block by block and 99 files that can be fast-copied, you want
to force the user to the block-by-block method for all 100 files. I
want to force the use of the block-by-block method for the 1 file
where that's the only valid method, and let them choose what they want
to do for the other 99.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade --copy-file-range

От

Thomas Munro

Дата:

23 марта, 03:25:38

Hmm, this discussion seems to assume that we only use
copy_file_range() to copy/clone whole segment files, right?  That's
great and may even get most of the available benefit given typical
databases with many segments of old data that never changes, but... I
think copy_write_range() allows us to go further than the other
whole-file clone techniques: we can stitch together parts of an old
backup segment file and an incremental backup to create a new file.
If you're interested in minimising disk use while also removing
dependencies on the preceding chain of backups, then it might make
sense to do that even if you *also* have to read the data to compute
the checksums, I think?  That's why I mentioned it: if
copy_file_range() (ie sub-file-level block sharing) is a solution in
search of a problem, has the world ever seen a better problem than
pg_combinebackup?

Re: pg_upgrade --copy-file-range

От

Robert Haas

Дата:

23 марта, 15:38:19

On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Hmm, this discussion seems to assume that we only use
> copy_file_range() to copy/clone whole segment files, right?  That's
> great and may even get most of the available benefit given typical
> databases with many segments of old data that never changes, but... I
> think copy_write_range() allows us to go further than the other
> whole-file clone techniques: we can stitch together parts of an old
> backup segment file and an incremental backup to create a new file.
> If you're interested in minimising disk use while also removing
> dependencies on the preceding chain of backups, then it might make
> sense to do that even if you *also* have to read the data to compute
> the checksums, I think?  That's why I mentioned it: if
> copy_file_range() (ie sub-file-level block sharing) is a solution in
> search of a problem, has the world ever seen a better problem than
> pg_combinebackup?

That makes sense; it's just a different part of the code than I
thought we were talking about.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade --copy-file-range

От

Tomas Vondra

Дата:

23 марта, 16:37:14


On 3/22/24 19:40, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Right, this will happen:
>>
>>   pg_combinebackup: error: unable to use accelerated copy when manifest
>>   checksums are to be calculated. Use --no-manifest
>>
>> Are you saying we should just silently override the copy method and do
>> the copy block by block?
> 
> Yes.
> 
>> I'm not strongly opposed to that, but it feels
>> wrong to just ignore that the user explicitly requested cloning, and I'm
>> not sure why should this be different from any other case when the user
>> requests incompatible combination of options and/or options that are not
>> supported on the current configuration.
> 
> I don't feel like copying block-by-block when that's needed to compute
> a checksum is ignoring what the user requested. I mean, if we'd had to
> perform reconstruction rather than copying an entire file, we would
> have done that regardless of whether --clone had been there, and I
> don't see why the need-to-compute-a-checksum case is any different. I
> think we should view a flag like --clone as specifying how to copy a
> file when we don't need to do anything but copy it. I don't think it
> should dictate that we're not allowed to perform other processing when
> that other processing is required.
> 
> From my point of view, this is not a case of incompatible options
> having been specified. If you specify run pg_basebackup with both
> --format=p and --format=t, those are incompatible options; the backup
> can be done one way or the other, but not both at once. But here
> there's no such conflict. Block-by-block copying and fast-copying can
> happen as part of the same operation, as in the example that I showed,
> where some files need the block-by-block copying and some can be
> fast-copied. The user is entitled to specify which fast-copying method
> they would like to have used for the files where fast-copying is
> possible without getting a failure just because it isn't possible for
> every single file.
> 
> Or to say it the other way around, if there's 1 file that needs to be
> copied block by block and 99 files that can be fast-copied, you want
> to force the user to the block-by-block method for all 100 files. I
> want to force the use of the block-by-block method for the 1 file
> where that's the only valid method, and let them choose what they want
> to do for the other 99.
> 

OK, that makes sense. Here's a patch that should work like this - in
copy_file we check if we need to calculate checksums, and either use the
requested copy method, or fall back to the block-by-block copy.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

v20240323-0001-pg_combinebackup-allow-using-clone-copy_fi.patch

Re: pg_upgrade --copy-file-range

От

Tomas Vondra

Дата:

23 марта, 16:47:12

On 3/23/24 13:38, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Hmm, this discussion seems to assume that we only use
>> copy_file_range() to copy/clone whole segment files, right?  That's
>> great and may even get most of the available benefit given typical
>> databases with many segments of old data that never changes, but... I
>> think copy_write_range() allows us to go further than the other
>> whole-file clone techniques: we can stitch together parts of an old
>> backup segment file and an incremental backup to create a new file.
>> If you're interested in minimising disk use while also removing
>> dependencies on the preceding chain of backups, then it might make
>> sense to do that even if you *also* have to read the data to compute
>> the checksums, I think?  That's why I mentioned it: if
>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>> search of a problem, has the world ever seen a better problem than
>> pg_combinebackup?
> 
> That makes sense; it's just a different part of the code than I
> thought we were talking about.
> 

Yeah, that's in write_reconstructed_file() and the patch does not touch
that at all. I agree it would be nice to use copy_file_range() in this
part too, and it doesn't seem it'd be that hard to do, I think.

It seems we'd just need a "fork" that either calls pread/pwrite or
copy_file_range, depending on checksums and what was requested.

BTW is there a reason why the code calls "write" and not "pg_pwrite"?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade --copy-file-range

От

Tomas Vondra

Дата:

23 марта, 20:57:49

On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right?  That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think?  That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
> 
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
> 
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
> 

Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...

I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.

I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 3/26/24 15:09, Jakub Wartak wrote:
> On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>> On 3/23/24 14:47, Tomas Vondra wrote:
>>> On 3/23/24 13:38, Robert Haas wrote:
>>>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> [..]
>>> Yeah, that's in write_reconstructed_file() and the patch does not touch
>>> that at all. I agree it would be nice to use copy_file_range() in this
>>> part too, and it doesn't seem it'd be that hard to do, I think.
>>>
>>> It seems we'd just need a "fork" that either calls pread/pwrite or
>>> copy_file_range, depending on checksums and what was requested.
>>>
>>
>> Here's a patch to use copy_file_range in write_reconstructed_file too,
>> when requested/possible. One thing that I'm not sure about is whether to
>> do pg_fatal() if --copy-file-range but the platform does not support it.
> [..]
> 
> Hi Tomas, so I gave a go to the below patches today:
> - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
> - v20240323-2-0002-write_reconstructed_file.patch
> 
> My assessment:
> 
> v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
> looks like more or less good to go

There's some issues with the --dry-run, pointed out by Robert. Should be
fixed in the attached version.

> v20240323-2-0002-write_reconstructed_file.patch - needs work and
> without that clone/copy_file_range() good effects are unlikely
> 
> Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
> tables with GiST and GIN indexes , just to see more WAL record types)
> and with backups sizes in MB like that:
> 
> 106831  full
> 2823    incr.1 # captured after some time with pgbench -R 100
> 165     incr.2 # captured after some time with pgbench -R 100
> 
> Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
> outtest full incr.1 incr.2
> 
> Test results of various copies on small I/O constrained XFS device:
> normal copy: 31m47.407s
> --clone copy: error: file cloning not supported on this platform (it's
> due #ifdef of having COPY_FILE_RANGE available)
> --copy-file-range: aborted, as it was taking too long , I was
> expecting it to accelerate, but it did not... obviously this is the
> transparent failover in case of calculating checksums...
> --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
> to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
> up with ENOSPC [more on this later]

That's really strange.

> --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
> --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> loop-fix: 5m1.986s but it creates corruption as it stands
> 

Thanks. I plan to do more similar tests, once my machines get done with
some other stuff.

> Issues:
> 
> 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> compains about win32/mingw:
> 
> [15:47:27.184] In file included from copy_file.c:22:
> [15:47:27.184] copy_file.c: In function ‘copy_file’:
> [15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
> this statement may fall through [-Werror=implicit-fallthrough=]
> [15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
> [15:47:27.184]       |      ^
> [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
> [15:47:27.184]    96 |     pg_log_debug("would copy \"%s\" to \"%s\"
> (copy_file_range)",
> [15:47:27.184]       |     ^~~~~~~~~~~~
> [15:47:27.184] copy_file.c:99:4: note: here
> [15:47:27.184]    99 |    case COPY_MODE_COPYFILE:
> [15:47:27.184]       |    ^~~~
> [15:47:27.184] cc1: all warnings being treated as errors
> 

Yup, missing break.

> 2. I do not know what's the consensus between --clone and
> --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> apply the same logic to the --help so that --clone is not visible
> there (for consistency?). Also the "error: file cloning not supported
> on this platform " is technically incorrect, Linux does support
> ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> over another (so technically it is "available"). Nitpicking I know.
> 

That's a good question, I'm not sure. But whatever we do, we should do
the same thing in pg_upgrade. Maybe there's some sort of precedent?

> 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> ENOSPACE spiral-of-death-bug symptoms are like that:
> 
> strace:
> copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192
> 
> Notice that dest_off_t (poutoff) is NULL.
> 
> (gdb) where
> #0  0x00007f2cd56f6733 in copy_file_range (infd=8,
> pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9,
> poutoff=poutoff@entry=0x0,
>     length=length@entry=8192, flags=flags@entry=0) at
> ../sysdeps/unix/sysv/linux/copy_file_range.c:28
> #1  0x00005555ecd077f4 in write_reconstructed_file
> (copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true,
> checksum_ctx=0x7ffc4cdb7700,
>     offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010,
> block_length=<optimized out>, output_filename=0x7ffc4cdba910
> "outtest/base/5/16427.29",
>     input_filename=0x7ffc4cdba510
> "incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648
> #2  reconstruct_from_incremental_file
> (input_filename=input_filename@entry=0x7ffc4cdba510
> "incr.2/base/5/INCREMENTAL.16427.29",
>     output_filename=output_filename@entry=0x7ffc4cdba910
> "outtest/base/5/16427.29",
> relative_path=relative_path@entry=0x7ffc4cdbc670 "base/5",
>     bare_file_name=bare_file_name@entry=0x5555ee2056ef "16427.29",
> n_prior_backups=n_prior_backups@entry=2,
>     prior_backup_dirs=prior_backup_dirs@entry=0x7ffc4cdbf248,
> manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10
> "base/5/16427.29",
>     checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864,
> checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false,
>     copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327
> 
> .. it's a spiral of death till ENOSPC. Reverting the
> v20240323-2-0002-write_reconstructed_file.patch helps. The problem
> lies in that do-wb=-inifity-loop (?) along with NULL for destination
> off_t. This seem to solves that thingy(?):
> 
> -                       do
> -                       {
> -                               wb = copy_file_range(s->fd,
> &offsetmap[i], wfd, NULL, BLCKSZ, 0);
> +                       //do
> +                       //{
> +                               wb = copy_file_range(s->fd,
> &offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0);
>                                 if (wb < 0)
>                                         pg_fatal("error while copying
> file range from \"%s\" to \"%s\": %m",
> 
> input_filename, output_filename);
> -                       } while (wb > 0);
> +                       //} while (wb > 0);
>  #else
> 
> ...so that way I've got it down to 5mins.
> 

Yeah, that retry logic is wrong. I ended up copying the check from the
"regular" copy branch, which simply bails out if copy_file_range returns
anything but the expected 8192.

I wonder if this should deal with partial writes, though. I mean, it's
allowed copy_file_range() copies only some of the bytes - I don't know
how often / in what situations that happens, though ... And if we want
to handle that for copy_file_range(), pwrite() needs the same treatment.

> 3. .. but onn startup I've got this after trying psql login: invalid
> page in block 0 of relation base/5/1259  . I've again reverted the
> v20240323-2-0002 to see if that helped for next-round of
> pg_combinebackup --manifest-checksums=NONE --copy-file-range and after
> 32mins of waiting it did help indeed: I was able to login and select
> counts worked and matched properly the data. I've reapplied the
> v20240323-2-0002 (with my fix to prevent that endless loop) and the
> issue was again(!) there. Probably it's related to the destination
> offset. I couldn't find more time to look on it today and the setup
> was big 100GB on slow device, so just letting You know as fast as
> possible.
> 

Can you see if you can still reproduce this with the attached version?

> 4. More efficiency is on the table option (optional patch node ; just
> for completeness; I dont think we have time for that? ): even if
> v20240323-2-0002 would work, the problem is that it would be sending
> syscall for every 8kB. We seem to be performing lots of per-8KB
> syscalls which hinder performance (both in copy_file_range and in
> normal copy):
> 
> pread64(8, ""..., 8192, 369115136)      = 8192 // 369115136 + 8192 =
> 369123328 (matches next pread offset)
> write(9, ""..., 8192)                   = 8192
> pread64(8, ""..., 8192, 369123328)      = 8192 // 369123328 + 8192 = 369131520
> write(9, ""..., 8192)                   = 8192
> pread64(8, ""..., 8192, 369131520)      = 8192 // and so on
> write(9, ""..., 8192)                   = 8192
> 
> Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> wastes lots of time on issuing instead small syscalls but it could
> let's say do single pread/write (or even copy_file_range()). I think
> it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> had much smaller modifications in my incremental (think of 99% of
> static data).
> 

Yes, I've been thinking about exactly this optimization, but I think
we're way past proposing this for PG17. The changes that would require
in reconstruct_from_incremental_file are way too significant. Has to
wait for PG18 ;-)

I do think there's more on the table, as mentioned by Thomas a couple
days ago - maybe we shouldn't approach clone/copy_file_range merely as
an optimization to save time, it might be entirely reasonable to do this
simply to allow the filesystem to do CoW magic and save space (even if
we need to read the data and recalculate the checksum, which now
disables these copy methods).


> 5. I think we should change the subject with new patch revision, so
> that such functionality for incremental backups is not buried down in
> the pg_upgrade thread ;)
> 

OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On 3/29/24 15:23, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> [ new patches ]
> 
> Tomas, thanks for pointing me to this email; as you speculated, gmail
> breaks threading if the subject line changes.
> 
> The documentation still needs work here:
> 
> - It refers to --link mode, which is not a thing.
> 
> - It should talk about the fact that in some cases block-by-block
> copying will still be needed, possibly mentioning that it specifically
> happens when the old backup manifest is not available or does not
> contain checksums or does not contain checksums of the right type, or
> maybe just being a bit vague.
> 
> In copy_file.c:
> 
> - You added an unnecessary blank line to the beginning of a comment block.
> 

Thanks, should be all cleaned up now, I think.

> - You could keep the strategy_implementation variable here. I don't
> think that's 100% necessary, but now that you've got the code
> structured this way, there's no compelling reason to remove it.
> 

Yeah, I think you're right. The strategy_implementation seemed a bit
weird to me because we now have 4 functions with different signatures.
Most only take srd/dst, but copy_file_blocks() also takes checksum. And
it seemed better to handle everything the same way, rather than treating
copy_file_blocks as an exception.

But it's not that bad, so 0001 has strategy_implementation again. But
I'll get back to this in a minute.

> - I don't know what the +/* XXX maybe this should do the check
> internally, same as the other functions? */ comment is talking about.
> 

I think this is stale. The XXX is about how the various functions
detect/report support. In most we have the ifdefs/pg_fatal() inside the
function, but CopyFile() has nothing like that, because the detection
happens earlier. I wasn't sure if maybe we should make all these
functions more alike, but I don't think we should.

> - Maybe these functions should have header comments.
> 

Right, added.

I was thinking about the comment [1] from a couple a days ago, where
Thomas suggested that maybe we should try doing the CoW stuff
(clone/copy_file_range) even in cases when we need to read the block,
say to calculate checksum, or even reconstruct from incremental backups.

I wasn't sure how big the benefits of the patches shared so far might
be, so I decided to do some tests. I did a fairly simple thing:

1) initialize a cluster with pgbench scale 5000 (~75GB)

2) create a full backup

3) do a run that updates ~1%, 10% and 20% of the blocks

4) create an incremental backup after each run

5) do pg_combinebackup for for each of the increments, with
block-by-block copy and copy_file_range, measure how long it takes and
how much disk space it consumes

I did this on xfs and btrfs, and it quickly became obvious that there's
very little benefit unless --no-manifest is used. Which makes perfect
sense, because pgbench is uniform updates so all segments need to be
reconstructed from increments (so copy_file.c is mostly irrelevant), and
write_reconstructed_file only uses copy_file_range() without checksums.

I don't know how common --no-manifest is going to be, but I guess people
will want to keep manifests in at least some backup schemes (e.g. to
rebuild full backup instead of having to take a full backup regularly).

So I decided to take a stab at Thomas' idea, i.e. reading the data to
calculate checksums, but then using copy_file_range instead of just
writing the data onto disk. This is in 0003, which relaxes the
conditions in 0002 shared a couple days ago. And this helped a lot.

The attached PDF shows results for xfs/btrfs. Charts on the left are
disk space occupied by the reconstructed backup, measured as difference
between "df" before and after running pg_combinebackup. The duration of
the pg_combinebackup execution is on the right. First row is without
manifest (i.e. --no-manifest), the second row is with manifest.

The 1%, 10% and 20% groups are for the various increments, updating
different fractions of the database.

The database is ~75GB, so that's what we expect a plain copy to have. If
there are some CoW benefits of copy_file_range, allowing the fs to reuse
some of the space or, the disk space should be reduced.  And similarly,
there could/should be some improvements in pg_combinebackup duration.

Each bar is a different copy method and patch:

* copy on master/0001/0002/0003 - we don't expect any difference between
these, it should all perform the same and use the "full" space

* copy_file_range on 0001/0002/0003 - 0001 should perform the same as
copy, because it's only about full-segment copies, and we don't any of
those here, 0002/0003 should help, depending on --no-manifest

And indeed, this is what we see. 0002/0003 use only a fraction of disk
space, roughly the same as the updated fraction (which matches the size
of the increment). This is nice.

For duration, the benefits seem to depend on the file system. For btrfs
it actually is faster, as expected. 0002/0003 saves maybe 30-50% of
time, compared to block-by-block copy. On XFS it's not that great, the
copy_file_range is actually slower by up to about 50%. And this is not
about the extra read - this affects the 0002/no-manifest case too, where
the read is not necessary.

I think this is fine. It's a tradeoff, where on some filesystems you can
save time or space, and on other filesystems you can save both. That's a
tradeoff for the users to decide, I think.

I'll see how this works on EXT4/ZFS next ...

But thinking about this a bit more, I realized there's no reason not to
apply the same logic to the copy_file part. I mean, if we need to copy a
file but also calculate a checksum, we can simply do the clone using
clone/copy_file_range, but then also read the file and calculate the
checksum ...

0004 does this, by simply passing the checksum_cxt around, which also
has the nice consequence that all the functions now have the same
signature, which makes the strategy_implementation work for all cases. I
need to do more testing of this, but like how this looks.

Of course, maybe there's not an agreement this is the right way to
approach this, and we should do the regular block-by-block copy?

There's one more change in 0003 - the checks if clone/copy_file_range
are supported by the platform now happen right at the beginning when
parsing the arguments, so that when a user specifies one of those
options, the error happens right away instead of sometime later when we
happen to hit one of those pg_fatal() places.

I think this is the right place to do these checks, as it makes the
write_reconstructed_file much easier to understand (without all the
ifdefs etc.).

But there's an argument whether this should fail with pg_fatal() or just
fallback to the default copy method.

BTW I wonder if it makes sense to only allow one of those methods? At
the moment the user can specify both --clone and --copy-file-range, and
which one "wins" depends on the order in which they are specified. Seems
confusing at best. But maybe it'd make sense to allow both, and e.g. use
clone() to copy whole segments and copy_file_range() for other places?

regards

[1]
https://www.postgresql.org/message-id/CA%2BhUKG%2B8KDk%2BpM6vZHWT6XtZzh-sdieUDohcjj0fia6aqK3Oxg%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

I've been running some benchmarks and experimenting with various stuff,
trying to improve the poor performance on ZFS, and the regression on XFS
when using copy_file_range. And oh boy, did I find interesting stuff ...

Attached is a PDF with results of my benchmark for ZFS/XFS/BTRFS, on my
two machines. I already briefly described what the benchmark does, but
to clarify:

1) generate backups: initialize pgbench scale 5000, do full backup,
update roughly 1%, 10% and 20% blocks and do an incremental backup after
each of those steps

2) combine backups: full + 1%, full + 1% + 10%, full + 1% + 10% + 20%

3) measure how long it takes and how much more disk space is used (to
see how well the CoW stuff works)

4) after each pg_combinebackup run to pg_verifybackup, start the cluster
to finish recovery, run pg_checksums --check (to check the patches don't
produce something broken)

There's a lot of interesting stuff to discuss, some of which was already
mentioned in this thread earlier - in particular, I want to talk about
block alignment, prefetching and processing larger chunks of blocks.

Attached is also all the patches including the ugly WIP parts discussed
later, complete results if you want to do your own analysis, and the
scripts used to generate/restore scripts.

FWIW I'm not claiming the patches are commit-ready (especially the new
WIP parts), but should be correct and good enough for discussion (that
applies especially to 0007). I think I could get them ready in a day or
two, but I'd like some feedback to my findings, and also if someone
would have objections to get this in so short before the feature freeze,
I'd prefer to know about that.

The patches are numbered the same as in the benchmark results, i.e. 0001
is "1", 0002 is "2" etc. The "0-baseline" option is current master
without any patches.

Now to the findings ....

1) block alignment
------------------

This was mentioned by Thomas a couple days ago, when he pointed out the
incremental files have a variable-length header (to record which blocks
are stored in the file), followed by the block data, which means the
block data is not aligned to fs block. I haven't realized this, I just
used whatever the reconstruction function received, but Thomas pointed
out this may interfere with CoW, which needs the blocks to be aligned.

And I think he's right, and my tests confirm this. I did a trivial patch
to align the blocks to 8K boundary, by forcing the header to be a
multiple of 8K (I think 4K alignment would be enough). See the 0001
patch that does this.

And if I measure the disk space used by pg_combinebackup, and compare
the results with results without the patch ([1] from a couple days
back), I see this:

pct not aligned aligned
-------------------------------------
1% 689M 19M
10% 3172M 22M
20% 13797M 27M

Yes, those numbers are correct. I didn't believe this at first, but the
backups are valid/verified, checksums are OK, etc. BTRFS has similar
numbers (e.g. drop from 20GB to 600MB).

If you look at the charts in the PDF, charts for on-disk space are on
the right side. It might seem like copy_file_range/CoW has no impact,
but that's just an illusion - the bars for the last three cases are so
small it's difficult to see them (especially on XFS). While this does
not show the impact of alignment (because all of the cases in these runs
have blocks aligned), it shows how tiny the backups can be made. But it
does have significant impact, per the preceding paragraph.

This also affect the prefetching, that I'm going to talk about next. But
having the blocks misaligned (spanning multiple 4K pages) forces the
system to prefetch more pages than necessary. I don't know how big the
impact is, because the prefetch patch is 0002, so I only have results
for prefetching on aligned blocks, but I don't see how it could not have
a cost.

I do think we should just align the blocks properly. The 0001 patch does
that simply by adding a bunch of \0 bytes up to the next 8K boundary.
Yes, this has a cost - if you have tiny files with only one or two
blocks changed, the increment file will be a bit larger. Files without
any blocks don't need alignment/padding, and as the number of blocks
increases, it gets negligible pretty quickly. Also, files use a multiple
of fs blocks anyway, so if we align to 4K blocks it wouldn't actually
need more space at all. And even if it does, it's all \0, so pretty damn
compressible (and I'm sorry, but if you care about tiny amounts of data
added by alignment, but refuse to use compression ...).

I think we absolutely need to align the blocks in the incremental files,
and I think we should do that now. I think 8K would work, but maybe we
should add alignment parameter to basebackup & manifest?

The reason why I think maybe this should be a basebackup parameter is
the recent discussion about large fs blocks - it seems to be in the
works, so maybe better to be ready and not assume all fs have 4K.

And I think we probably want to do this now, because this affects all
tools dealing with incremental backups - even if someone writes a custom
version of pg_combinebackup, it will have to deal with misaligned data.
Perhaps there might be something like pg_basebackup that "transforms"
the data received from the server (and also the backup manifest), but
that does not seem like a great direction.

Note: Of course, these space savings only exist thanks to sharing blocks
with the input backups, because the blocks in the combined backup point
to one of the other backups. If those old backups are removed, then the
"saved space" disappears because there's only a single copy.

2) prefetch
-----------

I was very puzzled by the awful performance on ZFS. When every other fs
(EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
advice I could think of, with almost no effect.

Ultimately I decided that it probably is the "no readahead" behavior
I've observed on ZFS. I assume it's because it doesn't use the page
cache where the regular readahead is detected etc. And there's no
prefetching in pg_combinebackup, so I decided to an experiment and added
a trivial explicit prefetch when reconstructing the file - every time
we'd read data from a file, we do posix_fadvise for up to 128 blocks
ahead (similar to what bitmap heap scan code does). See 0002.

And tadaaa - the duration dropped from 900-1000 seconds to only about
250-300 seconds, so an improvement of a factor of 3-4x. I think this is
pretty massive.

There's a couple more interesting ZFS details - the prefetching seems to
be necessary even when using copy_file_range() and don't need to read
the data (to calculate checksums). This is why the "manifest=off" chart
has the strange group of high bars at the end - the copy cases are fast
because prefetch happens, but if we switch to copy_file_range() there
are no prefetches and it gets slow.

This is a bit bizarre, especially because the manifest=on cases are
still fast, exactly because the pread + prefetching still happens. I'm
sure users would find this puzzling.

Unfortunately, the prefetching is not beneficial for all filesystems.
For XFS it does not seem to make any difference, but on BTRFS it seems
to cause a regression.

I think this means we may need a "--prefetch" option, that'd force
prefetching, probably both before pread and copy_file_range. Otherwise
people on ZFS are doomed and will have poor performance.

3) bulk operations
------------------

Another thing suggested by Thomas last week was that maybe we should try
detecting longer runs of blocks coming from the same file, and operate
on them as a single chunk of data. If you see e.g. 32 blocks, instead of
doing read/write or copy_file_range for each of them, we could simply do
one call for all those blocks at once.

I think this is pretty likely, especially for small incremental backups
where most of the blocks will come from the full backup. And I was
suspecting the XFS regression (where the copy-file-range was up to
30-50% slower in some cases, see [1]) is related to this, because the
perf profiles had stuff like this:

So I took a stab at this in 0007, which detects runs of blocks coming
from the same source file (limited to 128 blocks, i.e. 1MB). I only did
this for the copy_file_range() calls in 0007, and the results for XFS
look like this (complete results are in the PDF):

old (block-by-block) new (batches)
------------------------------------------------------
1% 150s 4s
10% 150-200s 46s
20% 150-200s 65s

Re: pg_combinebackup --copy-file-range

От

Jakub Wartak

Дата:

04 апреля, 13:25:46

On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Here's a much more polished and cleaned up version of the patches,
> fixing all the issues I've been aware of, and with various parts merged
> into much more cohesive parts (instead of keeping them separate to make
> the changes/evolution more obvious).

OK, so three runs of incrementalbackupstests - as stated earlier -
also passed with OK for v20240403 (his time even with
--enable-casserts)

pg_combinebackup flags tested were:
1) --copy-file-range --manifest-checksums=CRC32C
2) --copy-file-range --manifest-checksums=NONE
3) default, no flags (no copy-file-range)

> I changed how I think about this a bit - I don't really see the CoW copy
> methods as necessary faster than the regular copy (even though it can be
> with (5)). I think the main benefits are the space savings, enabled by
> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
> that I don't think the performance is an issue - everything has a cost.

I take i differently: incremental backups without CoW fs would be clearly :
- inefficient in terms of RTO (SANs are often a key thing due to
having fast ability to "restore" the clone rather than copying the
data from somewhere else)
- pg_basebackup without that would be unusuable without space savings
(e.g. imagine daily backups @ 10+TB DWHs)

> On 4/3/24 15:39, Jakub Wartak wrote:
> > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
[..]
> Thanks. Would be great if you could run this on the attached version of
> the patches, ideally for each of them independently, so make sure it
> doesn't get broken+fixed somewhere on the way.

Those are semi-manual test runs (~30 min? per run), the above results
are for all of them applied at once. So my take is all of them work
each one does individually too.

FWIW, I'm also testing your other offlist incremental backup
corruption issue, but that doesnt seem to be related in any way to
copy_file_range() patches here.

> >> 2) prefetch
> >> -----------
> > [..]
> >> I think this means we may need a "--prefetch" option, that'd force
> >> prefetching, probably both before pread and copy_file_range. Otherwise
> >> people on ZFS are doomed and will have poor performance.
> >
> > Right, we could optionally cover in the docs later-on various options
> > to get the performance (on XFS use $this, but without $that and so
> > on). It's kind of madness dealing with all those performance
> > variations.
> >
>
> Yeah, something like that. I'm not sure we want to talk too much about
> individual filesystems in our docs, because those things evolve over
> time too.

Sounds like Wiki then.

BTW, after a quick review: could we in 05 have something like common
value then (to keep those together via some .h?)

#define        BATCH_SIZE PREFETCH_TARGET ?

-J.

Re: pg_combinebackup --copy-file-range

От

Tomas Vondra

Дата:

04 апреля, 13:51:31

On 4/4/24 12:25, Jakub Wartak wrote:
> On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> Here's a much more polished and cleaned up version of the patches,
>> fixing all the issues I've been aware of, and with various parts merged
>> into much more cohesive parts (instead of keeping them separate to make
>> the changes/evolution more obvious).
> 
> OK, so three runs of incrementalbackupstests - as stated earlier -
> also passed with OK for v20240403 (his time even with
> --enable-casserts)
> 
> pg_combinebackup flags tested were:
> 1) --copy-file-range --manifest-checksums=CRC32C
> 2) --copy-file-range --manifest-checksums=NONE
> 3) default, no flags (no copy-file-range)
> 

Thanks!

>> I changed how I think about this a bit - I don't really see the CoW copy
>> methods as necessary faster than the regular copy (even though it can be
>> with (5)). I think the main benefits are the space savings, enabled by
>> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
>> that I don't think the performance is an issue - everything has a cost.
> 
> I take i differently: incremental backups without CoW fs would be clearly :
> - inefficient in terms of RTO (SANs are often a key thing due to
> having fast ability to "restore" the clone rather than copying the
> data from somewhere else)
> - pg_basebackup without that would be unusuable without space savings
> (e.g. imagine daily backups @ 10+TB DWHs)
> 

Right, although this very much depends on the backup scheme. If you only
take incremental backups, and then also a full backup once in a while,
the CoW stuff probably does not help much. The alignment (the only thing
affecting basebackups) may allow deduplication, but that's all I think.

If the scheme is more complex, and involves "merging" the increments
into the full backup, then this does help a lot. It'd even be possible
to cheaply clone instances this way, I think. But I'm not sure how often
would people do that on the same volume, to benefit from the CoW.

>> On 4/3/24 15:39, Jakub Wartak wrote:
>>> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
> [..]
>> Thanks. Would be great if you could run this on the attached version of
>> the patches, ideally for each of them independently, so make sure it
>> doesn't get broken+fixed somewhere on the way.
> 
> Those are semi-manual test runs (~30 min? per run), the above results
> are for all of them applied at once. So my take is all of them work
> each one does individually too.
> 

Cool, thanks.

> FWIW, I'm also testing your other offlist incremental backup
> corruption issue, but that doesnt seem to be related in any way to
> copy_file_range() patches here.
> 

Yes, that's entirely independent, happens with master too.

>>>> 2) prefetch
>>>> -----------
>>> [..]
>>>> I think this means we may need a "--prefetch" option, that'd force
>>>> prefetching, probably both before pread and copy_file_range. Otherwise
>>>> people on ZFS are doomed and will have poor performance.
>>>
>>> Right, we could optionally cover in the docs later-on various options
>>> to get the performance (on XFS use $this, but without $that and so
>>> on). It's kind of madness dealing with all those performance
>>> variations.
>>>
>>
>> Yeah, something like that. I'm not sure we want to talk too much about
>> individual filesystems in our docs, because those things evolve over
>> time too.
> 
> Sounds like Wiki then.
> 
> BTW, after a quick review: could we in 05 have something like common
> value then (to keep those together via some .h?)
> 
> #define        BATCH_SIZE PREFETCH_TARGET ?
> 

Yes, that's one of the things I'd like to refine a bit more. Making it
more consistent / clearer that these things are interdependent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

От

Tomas Vondra

Дата:

05 апреля, 22:43:18

Hi,

I have pushed the three patches of this series - the one that aligns
blocks, and the two adding clone/copy_file_range to pg_combinebackup.
The committed versions are pretty much the 2024/04/03 version, with
various minor cleanups (e.g. I noticed the docs are still claiming the
copy methods work only without checksum calculations, but that's no
longer true). I also changed the parameter order to keep the dry_run and
debug parameters last, it seems nicer this way.

The buildfarm reported two compile-time problems, both of them entirely
avoidable (reported by cfbot but I failed to notice that). Should have
known better ...

Anyway, with these patches committed, pg_combinebackup can use CoW stuff
to combine backups cheaply (especially in disk-space terms).

The first patch (block alignment) however turned out to be important
even for non-CoW filesystems, in some cases. I did a lot of benchmarks
with the standard block-by-block copying of data, and on a machine with
SSD RAID storage the duration went from ~400 seconds for some runs to
only about 150 seconds (with aligned blocks). My explanation is that
with the misaligned blocks the RAID often has to access two devices to
read a block, and the alignment makes that go away.

In the attached PDF with results (duration.pdf), showing the duration of
pg_combinebackup on an increment of a particular size (1%, 10% or 20%),
this is visible as a green square on the right. Those columns are
results relative to a baseline - which for "copy" is master before the
block alignment patch, and for "copy_file_range" it's the 3-reconstruct
(adding copy_file_range to combining blocks from increments).

FWIW the last three columns are a comparison with prefetching enabled.

There's a couple interesting observations from this, based on which I'm
not going to try to get the remaining patches (batching and prefetching)
into v17. It clearly needs more analysis to make the right tradeoff.

From the table, I think it's clear that:

0) The impact of block alignment on RAID storage, with regular copy.

1) The batching (origina patch 0005) either does not help the regular
copy, or it actually makes it slower. The PDF is a bit misleading
because it seems to suggest the i5 machine is unaffected, while the xeon
gets ~30% slower. But that's just an illusion - the comparison is to
master, but the alignment patch made i5 about 2x faster. So it's 200%
slower when compared to "current master" with the alignment patch.

That's not great :-/ And also a bit strange - I would have expected the
batching to help the simple copy too. I haven't looked into why this
happens, so there's a chance I made some silly mistake, who knows.

For the copy_file_range case the batching is usually very beneficial,
sometimes reducing the duration to a fraction of the non-batched case.

My interpretation is that (unless there's a bug in the patch) we may
need two variants of that code - a non-batched one for regular copy, and
a batched variant for copy_file_range.

2) The prefetching is not a huge improvement, at least not for these
three filesystems (btrfs, ext4, xfs). From the color scale it might seem
like it helps, but those values are relative to the baseline, so when
the non-prefetching value is 5% and with prefetching 10%, that means the
prefetching makes it slower. And that's very often true.

This is visible more clearly in prefetching.pdf, comparing the
non-prefetching and prefetching results for each patch, not to baseline.
That's makes it quite clear there's a lot of "red" where prefetching
makes it slower. It certainly does help for larger increments (which
makes sense, because the modified blocks are distributed randomly, and
thus come from random files, making long streaks unlikely).

I've imagined the prefetching could be made a bit smarter to ignore the
streaks (=sequential patterns), but once again - this only matters with
the batching, which we don't have. And without the batching it looks
like a net loss (that's the first column in the prefetching PDF).

I did start thinking about prefetching because of ZFS, where it was
necessary to get decent performance. And that's still true. But (a) even
with the explicit prefetching it's still 2-3x slower than any of these
filesystems, so I assume performance-sensitive use cases won't use it.
And (b) the prefetching seems necessary in all cases, no matter how
large the increment is. Which goes directly against the idea of looking
at how random the blocks are and prefetching only the sufficiently
random patterns. That doesn't seem like a great thing.

3) There's also the question of disk space usage. The size.pdf shows how
the patches affect space needed for the pg_combinebackup result. It does
depend a bit on the internal fs cleanup for each run, but it seems the
batching makes a difference - clearly copying 1MB blocks instead of 8kB
allows lower overhead for some filesystems (e.g. btrfs, where we get
from ~1.5GB to a couple MBs). But the space savings are quite negligible
compared to just using --copy-file-range option (where we get from 75GB
to 1.5GB). I think the batching is interesting mostly because of the
substantial duration reduction.

I'm also attaching the benchmarking script I used (warning: ugly!), and
results for the three filesystems. For ZFS I only have partial results
so far, because it's so slow, but in general - without prefetching it's
slow (~1000s) with prefetching it's better but still slow (~250s).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: pg_combinebackup --copy-file-range

От

Tomas Vondra

Дата:

07 апреля, 20:46:59

On 4/5/24 21:43, Tomas Vondra wrote:
> Hi,
> 
> ...
> 
> 2) The prefetching is not a huge improvement, at least not for these
> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
> like it helps, but those values are relative to the baseline, so when
> the non-prefetching value is 5% and with prefetching 10%, that means the
> prefetching makes it slower. And that's very often true.
> 
> This is visible more clearly in prefetching.pdf, comparing the
> non-prefetching and prefetching results for each patch, not to baseline.
> That's makes it quite clear there's a lot of "red" where prefetching
> makes it slower. It certainly does help for larger increments (which
> makes sense, because the modified blocks are distributed randomly, and
> thus come from random files, making long streaks unlikely).
> 
> I've imagined the prefetching could be made a bit smarter to ignore the
> streaks (=sequential patterns), but once again - this only matters with
> the batching, which we don't have. And without the batching it looks
> like a net loss (that's the first column in the prefetching PDF).
> 
> I did start thinking about prefetching because of ZFS, where it was
> necessary to get decent performance. And that's still true. But (a) even
> with the explicit prefetching it's still 2-3x slower than any of these
> filesystems, so I assume performance-sensitive use cases won't use it.
> And (b) the prefetching seems necessary in all cases, no matter how
> large the increment is. Which goes directly against the idea of looking
> at how random the blocks are and prefetching only the sufficiently
> random patterns. That doesn't seem like a great thing.
> 

I finally got a more complete ZFS results, and I also decided to get
some numbers without the ZFS tuning I did. And boy oh boy ...

All the tests I did with ZFS were tuned the way I've seen recommended
when using ZFS for PostgreSQL, that is

  zfs set recordsize=8K logbias=throughput compression=none

and this performed quite poorly - pg_combinebackup took 4-8x longer than
with the traditional filesystems (btrfs, xfs, ext4) and the only thing
that improved that measurably was prefetching.

But once I reverted back to the default recordsize of 128kB the
performance is waaaaaay better - entirely comparable to ext4/xfs, while
btrfs remains faster with --copy-file-range --no-manigest (by a factor
of 2-3x).

This is quite clearly visible in the attached "current.pdf" which shows
results for the current master (i.e. filtered to the 3-reconstruct patch
adding CoW stuff to write_reconstructed_file).

There's also some differences in the disk usage, where ZFS seems to need
more space than xfs/btrfs (as if there was no block sharing), but maybe
that's due to how I measure this using df ...

I also tried also "completely default" ZFS configuration, with all
options left at the default (recordsize=128kB, compression=lz4, and
logbias=latency). That performs about the same, except that the disk
usage is lower thanks to the compression.

note: Because I'm hip cool kid, I also ran the tests on bcachefs. The
results are included in the CSV/PDF attachments. In general it's much
slower than xfs/btrfs/ext, and the disk space is somewhere in between
btrfs and xfs (for the CoW cases). We'll see how this improves as it
matures in the future.

The attachments are tables with the total duration / disk space usage,
and impact of prefetching. The tables are similar to what I shared
before, except that the color scale is applied to the values directly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: pg_combinebackup --copy-file-range

От

Tomas Vondra

Дата:

07 апреля, 20:53:08

On 4/7/24 19:46, Tomas Vondra wrote:
> On 4/5/24 21:43, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> 2) The prefetching is not a huge improvement, at least not for these
>> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
>> like it helps, but those values are relative to the baseline, so when
>> the non-prefetching value is 5% and with prefetching 10%, that means the
>> prefetching makes it slower. And that's very often true.
>>
>> This is visible more clearly in prefetching.pdf, comparing the
>> non-prefetching and prefetching results for each patch, not to baseline.
>> That's makes it quite clear there's a lot of "red" where prefetching
>> makes it slower. It certainly does help for larger increments (which
>> makes sense, because the modified blocks are distributed randomly, and
>> thus come from random files, making long streaks unlikely).
>>
>> I've imagined the prefetching could be made a bit smarter to ignore the
>> streaks (=sequential patterns), but once again - this only matters with
>> the batching, which we don't have. And without the batching it looks
>> like a net loss (that's the first column in the prefetching PDF).
>>
>> I did start thinking about prefetching because of ZFS, where it was
>> necessary to get decent performance. And that's still true. But (a) even
>> with the explicit prefetching it's still 2-3x slower than any of these
>> filesystems, so I assume performance-sensitive use cases won't use it.
>> And (b) the prefetching seems necessary in all cases, no matter how
>> large the increment is. Which goes directly against the idea of looking
>> at how random the blocks are and prefetching only the sufficiently
>> random patterns. That doesn't seem like a great thing.
>>
> 
> I finally got a more complete ZFS results, and I also decided to get
> some numbers without the ZFS tuning I did. And boy oh boy ...
> 
> All the tests I did with ZFS were tuned the way I've seen recommended
> when using ZFS for PostgreSQL, that is
> 
>   zfs set recordsize=8K logbias=throughput compression=none
> 
> and this performed quite poorly - pg_combinebackup took 4-8x longer than
> with the traditional filesystems (btrfs, xfs, ext4) and the only thing
> that improved that measurably was prefetching.
> 
> But once I reverted back to the default recordsize of 128kB the
> performance is waaaaaay better - entirely comparable to ext4/xfs, while
> btrfs remains faster with --copy-file-range --no-manigest (by a factor
> of 2-3x).
> 

I forgot to explicitly say that I think confirms the decision to not
push the patch adding the explicit prefetching to pg_combinebackup. It's
not needed/beneficial even for ZFS, when using a suitable configuration.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pg_upgrade --copy-file-range

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения