Обсуждение: Two patches to speed up pg_rewind.

Поиск
Список
Период
Сортировка

Two patches to speed up pg_rewind.

От
Paul Guo
Дата:
While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.

First one: pg_rewind would fsync the whole pgdata directory on the target by default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.

There are several things that may need further discussions:

1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”,
    but now the code guarded by it is just pre_sync_fname() relevant so we might want
    to rename it as HAVE_PRE_SYNC kind of name?

2. Pre_sync_fname() implementation

    The code looks like this:
  #if defined(HAVE_SYNC_FILE_RANGE)
      (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
  #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
      (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

    I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
    I did not check the Linux Kernel code but according to the man
    page I suspect that this option might cause the kernel tends to evict the related kernel
    pages from the page cache, which might not be something we expect. This is
    not a big issue since sync_file_range() should exist on many widely used Linux.

    Also I’m not sure how much we could benefit from the pre_sync code. Also note if the
    directory has a lot of files or the IO is fast, pre_sync_fname() might slow down
    the process instead. The reasons are: If there are a lot of files it is possible that we need
    to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since
   the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background
   dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.

   A better solution might be launch a separate pthread and do fsync one by one
   when pg_rewind finishes handling one file. pg_basebackup could use the solution also.

   Anyway this is independent of this patch.

Second one is use copy_file_range() for the local rewind case to replace read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could benefit more
on Linux fs on network-based block storage.

Regards,
Paul
Вложения

Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Wed, Jan 27, 2021 at 09:18:48AM +0000, Paul Guo wrote:
> Second one is use copy_file_range() for the local rewind case to replace read()+write().
> This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
> code could use copy_file_range() if needed. copy_file_range() was introduced
> In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
> might be better than read()+write() but copy_file_range() is more interesting
> given that it could skip the data copying in some file systems - this could benefit more
> on Linux fs on network-based block storage.

Have you done some measurements?
--
Michael

Вложения

Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:


On Jan 28, 2021, at 3:31 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 27, 2021 at 09:18:48AM +0000, Paul Guo wrote:
Second one is use copy_file_range() for the local rewind case to replace read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could benefit more
on Linux fs on network-based block storage.

Have you done some measurements?

I did not test pg_rewind but for patch 2, I tested copy_fiile_range() vs read()+write()
on XFS in Ubuntu 20.04.1 when working on the patches,

Here is the test time of 1G file (fully populated with random data) copy. The test is a simple C program.

copy_file_range() loop (actually it finished after one call) + fsync()
0m0.048s

For read()+write() loop with read/write buffer size 32K + fsync()
0m5.004s

For patch 1, it skips syncing less files so it surely benefits the performance.

Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:
Refactored the code a bit along with fixes. Manually tested them on centos
& Ubuntu (the later has copy_file_range())

For the first patch, actually I have some concerns. My assumption is that
the target pg_data directory should be fsync-ed already. This should be
correct normally but there is one scenario: a cleanly-shutdown database’s
pgdata directory was copied to another directory, in this case the new pgdata
is not fsync-ed - I’m not sure if that exists in real production environment or not,
but even considering this we could still use the optimization for the case that
calls ensureCleanShutdown() since this ensures a pgdata fsync on the target.



Вложения

Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:

 

> On 2021/2/19, 10:33 AM, "Paul Guo" <guopa@vmware.com> wrote:

 

> Refactored the code a bit along with fixes. Manually tested them on centos
> & Ubuntu (the later has copy_file_range())

> For the first patch, actually I have some concerns. My assumption is that
> the target pg_data directory should be fsync-ed already. This should be
> correct normally but there is one scenario: a cleanly-shutdown database’s
> pgdata directory was copied to another directory, in this case the new pgdata
> is not fsync-ed - I’m not sure if that exists in real production environment or not,
> but even considering this we could still use the optimization for the case that
> calls ensureCleanShutdown() since this ensures a pgdata fsync on the target.

Did some small modification and rebased the code. See attached for the new version.

 

Вложения

Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Fri, May 28, 2021 at 05:30:51AM +0000, Paul Guo wrote:
> Did some small modification and rebased the code. See attached for the new version.

Regarding patch 0002, I find the inter-dependencies between
write_target_range() and copy_target_range() a bit confusing.  There
is also a bit of duplication for dry_run, fetch_done and the progress
reporting.  Perhaps it would be cleaner to have a fallback
implementation of copy_file_range() in src/port/ and reduce the
footprint of the patch in pg_rewind?

Note: FreeBSD 13~ has support for copy_file_range(), nice..  Adding
Thomas in CC in case I am missing something.
--
Michael

Вложения

Re: Two patches to speed up pg_rewind.

От
Thomas Munro
Дата:
On Wed, Jun 2, 2021 at 5:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> Note: FreeBSD 13~ has support for copy_file_range(), nice..  Adding
> Thomas in CC in case I am missing something.

Yeah, so at least in theory Linux and FreeBSD can now both do tricks
like pushing copies down to network filesystems, COW file systems, and
(I believe not actually done by anyone yet, could be wrong) SCSI and
NVMe devices (they have commands like XCOPY that can copy block ranges
directly).  I read a few things about all that, and I had a trivial
patch to try to use it in the places in the backend where we copy
files (like cloning a database with CREATE DATABASE and moving files
with ALTER TABLE SET TABLESPACE), but I hadn't got as far as actually
trying it on any interesting filesystems or figuring out any really
good uses for it.  FWIW, here it is:

https://github.com/postgres/postgres/compare/master...macdice:copy_file_range

The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
you cross a filesystem boundary, is that something we need to worry
abou there?



Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> you cross a filesystem boundary, is that something we need to worry
> about there?

Hmm.  Good point.  That may justify having a switch to control that.
--
Michael

Вложения

Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > you cross a filesystem boundary, is that something we need to worry
> > about there?
>
> Hmm.  Good point.  That may justify having a switch to control that.

Paul, the patch set still needs some work, so I am switching it as
waiting on author.  I am pretty sure that we had better have a
fallback implementation of copy_file_range() in src/port/, and that we
are going to need an extra switch in pg_rewind to allow users to
bypass copy_file_range()/EXDEV if they do a local rewind operation
across different FSes with a kernel < 5.3.
--
Michael

Вложения

Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:

No worry I’m work on this.

 

On 2021/6/17, 3:18 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:

> On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:

> > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if

> > you cross a filesystem boundary, is that something we need to worry

> > about there?

>

> Hmm.  Good point.  That may justify having a switch to control that.

 

Paul, the patch set still needs some work, so I am switching it as

waiting on author.  I am pretty sure that we had better have a

fallback implementation of copy_file_range() in src/port/, and that we

are going to need an extra switch in pg_rewind to allow users to

bypass copy_file_range()/EXDEV if they do a local rewind operation

across different FSes with a kernel < 5.3.

--

Michael

 

Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:
On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > > you cross a filesystem boundary, is that something we need to worry
> > > about there?
> >
> > Hmm.  Good point.  That may justify having a switch to control that.
>
> Paul, the patch set still needs some work, so I am switching it as
> waiting on author.  I am pretty sure that we had better have a
> fallback implementation of copy_file_range() in src/port/, and that we
> are going to need an extra switch in pg_rewind to allow users to
> bypass copy_file_range()/EXDEV if they do a local rewind operation
> across different FSes with a kernel < 5.3.
> --

I did modification on the copy_file_range() patch yesterday by simply falling
back to read()+write() but I think it could be improved further.

We may add a function to determine two file/path are copy_file_range()
capable or not (using POSIX standard statvfs():f_fsid?) - that could be used
by other copy_file_range() users although in the long run the function
is not needed.
And even having this we may still need the fallback code if needed.

- For pg_rewind, we may just determine that ability once on src/dst pgdata, but
  since there might be soft link (tablespace/wal) in pgdata so we should still
  allow fallback for those non copy_fie_range() capable file copying.
- Also it seems that sometimes copy_file_range() could return ENOTSUP/EOPNOTSUP
  (the file system does not support that and the kernel does not fall
back to simple copying?)
  although this is not documented and it seems not usual?

Any idea?



Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:
On Tue, Jun 22, 2021 at 11:08 AM Paul Guo <paulguo@gmail.com> wrote:
>
> On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> > > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > > > you cross a filesystem boundary, is that something we need to worry
> > > > about there?
> > >
> > > Hmm.  Good point.  That may justify having a switch to control that.
> >
> > Paul, the patch set still needs some work, so I am switching it as
> > waiting on author.  I am pretty sure that we had better have a
> > fallback implementation of copy_file_range() in src/port/, and that we
> > are going to need an extra switch in pg_rewind to allow users to
> > bypass copy_file_range()/EXDEV if they do a local rewind operation
> > across different FSes with a kernel < 5.3.
> > --
>
> I did modification on the copy_file_range() patch yesterday by simply falling
> back to read()+write() but I think it could be improved further.
>
> We may add a function to determine two file/path are copy_file_range()
> capable or not (using POSIX standard statvfs():f_fsid?) - that could be used
> by other copy_file_range() users although in the long run the function
> is not needed.
> And even having this we may still need the fallback code if needed.
>
> - For pg_rewind, we may just determine that ability once on src/dst pgdata, but
>   since there might be soft link (tablespace/wal) in pgdata so we should still
>   allow fallback for those non copy_fie_range() capable file copying.
> - Also it seems that sometimes copy_file_range() could return ENOTSUP/EOPNOTSUP
>   (the file system does not support that and the kernel does not fall
> back to simple copying?)
>   although this is not documented and it seems not usual?
>
> Any idea?

I modified the copy_file_range() patch using the below logic:

If the first call of copy_file_range() fails with errno EXDEV or
ENOTSUP, pg_rewind
would not use copy_file_range() in rest code, and if copy_file_range() fails we
fallback to use the previous read()+write() code logic for the file.

Вложения

Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Thu, Aug 05, 2021 at 06:18:03PM +0800, Paul Guo wrote:
> I modified the copy_file_range() patch using the below logic:
>
> If the first call of copy_file_range() fails with errno EXDEV or
> ENOTSUP, pg_rewind
> would not use copy_file_range() in rest code, and if copy_file_range() fails we
> fallback to use the previous read()+write() code logic for the file.

I have looked at 0001, and I don't really like it.  One argument
against this approach is that if pg_rewind fails in the middle of its
operation then we would have done a set of fsync() for nothing, with
the data folder still unusable.  I would be curious to see some
numbers to see how much it matters with many physical files (say cases
with thousands of small relations?).

+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif !defined(WIN32) && defined(MS_ASYNC)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1

This is wrong for the code frontend on platforms that may finish by
using MS_ASYNC, no?  There is no such implementation in file_utils.c
but there is one in fd.c.

+   fsync_fname("global/pg_control", false);
+   fsync_fname("backup_label", false);
+   if (access("recovery.conf", F_OK) == 0)
+       fsync_fname("recovery.conf", false);
+   if (access("postgresql.auto.conf", F_OK) == 0)
+       fsync_fname("postgresql.auto.conf", false);

This list is wrong on various aspects, no?  This would miss custom
configuration files, or included files.

-   if (showprogress)
-       pg_log_info("syncing target data directory");
-   sync_target_dir();
-
    /* Also update the standby configuration, if requested. */
    if (writerecoveryconf && !dry_run)
    WriteRecoveryConfig(conn, datadir_target,
                    GenerateRecoveryConfig(conn, NULL));

+   if (showprogress)
+       pg_log_info("syncing target data directory");
+   perform_sync(filemap);

Why inverting the order here?

+           * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
+           * (return EXDEV). Some fs do not support copy_file_range() (return
+           * ENOTSUP). Here we explicitly disable copy_file_range() for the
+           * two scenarios. For other failures we still allow subsequent
+           * copy_file_range() try.
+           */
+           if (errno == ENOTSUP || errno == EXDEV)
+               copy_file_range_support = false;
Are you sure that it is right to always cancel the support of
copy_file_range() after it does not work once?  Couldn't it be
possible that things work properly depending on the tablespace being
worked on by pg_rewind?

Having the facility for copy_file_range() in pg_rewind is not nice at
the end, and we are going to need a run-time check to fallback
dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
Hmm.  What about locating all that in file_utils.c instead, with a
brand new routine name (pg_copy_file_range would be one choice)?  We
still need the ./configure check, except that the conditions to use
the fallback implementation is in this routine, aka fallback on EXDEV,
ENOTSUP or !HAVE_COPY_FILE_RANGE.  The backend may need to solve this
problem at some point, but logging and fd handling will likely just
locate that in fd.c, so having one routine for the purpose of all
frontends looks like a step in the right direction.
--
Michael

Вложения

Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Tue, Aug 17, 2021 at 04:47:44PM +0900, Michael Paquier wrote:
> One argument
> against this approach is that if pg_rewind fails in the middle of its
> operation then we would have done a set of fsync() for nothing, with
> the data folder still unusable.

I was skimming through the patch this morning, and that argument does
not hold much water as the flushes happen in the same place.  Seems
like I got confused, sorry about that.

> I would be curious to see some
> numbers to see how much it matters with many physical files (say cases
> with thousands of small relations?).

For this one, one simple idea would be to create a lot of fake
relation files with a pre-determined size and check how things
change.
--
Michael

Вложения

Re: Two patches to speed up pg_rewind.

От
Paul Guo
Дата:
Thanks for reviewing, please see the replies below.

On Tue, Aug 17, 2021 at 3:47 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Aug 05, 2021 at 06:18:03PM +0800, Paul Guo wrote:
> > I modified the copy_file_range() patch using the below logic:
> >
> > If the first call of copy_file_range() fails with errno EXDEV or
> > ENOTSUP, pg_rewind
> > would not use copy_file_range() in rest code, and if copy_file_range() fails we
> > fallback to use the previous read()+write() code logic for the file.
>
> I have looked at 0001, and I don't really like it.  One argument
> against this approach is that if pg_rewind fails in the middle of its
> operation then we would have done a set of fsync() for nothing, with
> the data folder still unusable.  I would be curious to see some
> numbers to see how much it matters with many physical files (say cases
> with thousands of small relations?).
> +/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +#define PG_FLUSH_DATA_WORKS 1
> +#elif !defined(WIN32) && defined(MS_ASYNC)
> +#define PG_FLUSH_DATA_WORKS 1
> +#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
> +#define PG_FLUSH_DATA_WORKS 1
>
> This is wrong for the code frontend on platforms that may finish by
> using MS_ASYNC, no?  There is no such implementation in file_utils.c
> but there is one in fd.c.

Yes, it seems that we need to add the MS_ASYNC code (refer that in fd.c) in
src/common/file_utils.c:pre_sync_fname().

> +   fsync_fname("global/pg_control", false);
> +   fsync_fname("backup_label", false);
> +   if (access("recovery.conf", F_OK) == 0)
> +       fsync_fname("recovery.conf", false);
> +   if (access("postgresql.auto.conf", F_OK) == 0)
> +       fsync_fname("postgresql.auto.conf", false);
>
> This list is wrong on various aspects, no?  This would miss custom
> configuration files, or included files.

I did not understand this. Can you please clarify? Anyway let me
explain, here we fsync
these files additionally because pg_rewind (possibly) modified these
files after rewinding.
These files may not be handled/logged in filemap

pg_control action is FILE_ACTION_NONE
backup_label is excluded
recovery.conf is not logged in filemap
postgresql.auto.conf may be logged but let's fsync this file for safety.

>
> -   if (showprogress)
> -       pg_log_info("syncing target data directory");
> -   sync_target_dir();
> -
>     /* Also update the standby configuration, if requested. */
>     if (writerecoveryconf && !dry_run)
>         WriteRecoveryConfig(conn, datadir_target,
>                             GenerateRecoveryConfig(conn, NULL));
>
> +   if (showprogress)
> +       pg_log_info("syncing target data directory");
> +   perform_sync(filemap);
>
> Why inverting the order here?

We need to synchronize the recoveryconf change finally in perform_sync().

>
> +           * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
> +           * (return EXDEV). Some fs do not support copy_file_range() (return
> +           * ENOTSUP). Here we explicitly disable copy_file_range() for the
> +           * two scenarios. For other failures we still allow subsequent
> +           * copy_file_range() try.
> +           */
> +           if (errno == ENOTSUP || errno == EXDEV)
> +               copy_file_range_support = false;
> Are you sure that it is right to always cancel the support of
> copy_file_range() after it does not work once?  Couldn't it be
> possible that things work properly depending on the tablespace being
> worked on by pg_rewind?

Ideally we should retry when first running into a symlink (e.g.
tablespace, wal),
but it seems not easy to do gracefully.

> Having the facility for copy_file_range() in pg_rewind is not nice at
> the end, and we are going to need a run-time check to fallback
> dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
> Hmm.  What about locating all that in file_utils.c instead, with a
> brand new routine name (pg_copy_file_range would be one choice)?  We
> still need the ./configure check, except that the conditions to use
> the fallback implementation is in this routine, aka fallback on EXDEV,
> ENOTSUP or !HAVE_COPY_FILE_RANGE.  The backend may need to solve this
> problem at some point, but logging and fd handling will likely just
> locate that in fd.c, so having one routine for the purpose of all
> frontends looks like a step in the right direction.

Yes, seems better to make it generic.



Re: Two patches to speed up pg_rewind.

От
Michael Paquier
Дата:
On Fri, Aug 20, 2021 at 11:33:33AM +0800, Paul Guo wrote:
> On Tue, Aug 17, 2021 at 3:47 PM Michael Paquier <michael@paquier.xyz> wrote:
> > +   fsync_fname("global/pg_control", false);
> > +   fsync_fname("backup_label", false);
> > +   if (access("recovery.conf", F_OK) == 0)
> > +       fsync_fname("recovery.conf", false);
> > +   if (access("postgresql.auto.conf", F_OK) == 0)
> > +       fsync_fname("postgresql.auto.conf", false);
> >
> > This list is wrong on various aspects, no?  This would miss custom
> > configuration files, or included files.
>
> I did not understand this. Can you please clarify? Anyway let me
> explain, here we fsync
> these files additionally because pg_rewind (possibly) modified these
> files after rewinding.
> These files may not be handled/logged in filemap
>
> pg_control action is FILE_ACTION_NONE
> backup_label is excluded
> recovery.conf is not logged in filemap
> postgresql.auto.conf may be logged but let's fsync this file for safety.

I am referring to new files copied from the origin cluster to the
target, as pg_rewind copies everything.  postgresql.conf could for
example include a foo.conf, which would be ignored here.

Anyway, it seems to me that the copy_file_range() bits of the patch
with the copy optimizations are more interesting that the flush
optimizations, so I would tend to focus on that first.

>> +           * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
>> +           * (return EXDEV). Some fs do not support copy_file_range() (return
>> +           * ENOTSUP). Here we explicitly disable copy_file_range() for the
>> +           * two scenarios. For other failures we still allow subsequent
>> +           * copy_file_range() try.
>> +           */
>> +           if (errno == ENOTSUP || errno == EXDEV)
>> +               copy_file_range_support = false;
>> Are you sure that it is right to always cancel the support of
>> copy_file_range() after it does not work once?  Couldn't it be
>> possible that things work properly depending on the tablespace being
>> worked on by pg_rewind?
>
> Ideally we should retry when first running into a symlink (e.g.
> tablespace, wal),
> but it seems not easy to do gracefully.

My guess here is that we should just remove this flag, and attempt
copy_range_file() for each file.  That brings an extra point that
requires benchmarking, actually.

>> Having the facility for copy_file_range() in pg_rewind is not nice at
>> the end, and we are going to need a run-time check to fallback
>> dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
>> Hmm.  What about locating all that in file_utils.c instead, with a
>> brand new routine name (pg_copy_file_range would be one choice)?  We
>> still need the ./configure check, except that the conditions to use
>> the fallback implementation is in this routine, aka fallback on EXDEV,
>> ENOTSUP or !HAVE_COPY_FILE_RANGE.  The backend may need to solve this
>> problem at some point, but logging and fd handling will likely just
>> locate that in fd.c, so having one routine for the purpose of all
>> frontends looks like a step in the right direction.
>
> Yes, seems better to make it generic.

One disadvantage of having a fallback implementation in file_utils.c,
now that I look closely, is that we would make the progress reporting
less verbose as we now call write_target_range() every 8kB for each
block.  So on this point, your approach keeps the code simpler, while
my suggestion makes this logic more complicated.

Anyway, I think that it would be good to do more benchmarking for this
patch first, and the patch you are proposing is enough for that.
There are two scenarios I can think as useful to look at, to emulate
cases where pg_rewind has to copy a set of files:
- A small number of large files (say 3~5 files of 600MB~1GB).
- Many small files (say 8MB~16MB with 200~ files).
Those numbers can be tweaked up or down, as long as a difference can
be measured while avoiding noise in runtimes.

I only have at hand now a system with ext4 on a 5.10 kernel, that does
not have any acceleration techniques as far as I know, so I have just
measured that this introduces no regressions.  But it would be good to
also see how much performance we'd gain with something that can take
advantage of copy_file_range() in full.  One good case would be XFS
with reflink=1 and see how fast we go with the two cases from above
with and without the patch.  Anything I have read on the matter means
that the copy will be faster, but it would be good to have numbers to
confirm.

copy_file_range_support in the patch is also something I am worrying
about, as it could lead to incorrect decisions depending on the order
of the paths processed depending on the mount points of the origin
and/or the target.  Removing it means that it would be important to
measure the case where we use copy_file_range(), but fail all (or some
of!) its calls on EXDEV.  That would imply a test to compare the
performance with and without the patch where the origin and target
folders are on different mount points.  I looked at some kernel code
from 5.2 and anything looks cheap enough with a lookup at the inodes
of the source and targets to check the mount points involved.

So, what do you think?

By the way, the progress reporting is wrong in this new code path:

+bool
+copy_target_range(int srcfd, off_t begin, size_t size)
+{
+   ssize_t     copylen;
+
+   /* update progress report */
+   fetch_done += size;
+   progress_report(false);

If copy_file_range() returns false, say because of EXDEV, we would
finish by counting the same size twice, with write_target_range()
reporting this amount of size a second time for each block of 8kB.
--
Michael

Вложения