Обсуждение: Executing pg_createsubscriber with a non-compatible control file

Поиск
Список
Период
Сортировка

Executing pg_createsubscriber with a non-compatible control file

От
Michael Paquier
Дата:
Hi all,
(Adding Euler in CC.)

I have mentioned that on this thread:
https://www.postgresql.org/message-id/aOM-T6iojyzb7OVB@paquier.xyz

Attempting to execute pg_createsubscriber compiled with a version of
PG_CONTROL_VERSION incompatible with the standby we want to switch to
a logical replica leads to the following error:
$ pg_createsubscriber -D $HOME/data/5433 \
    -P "host=/tmp port=5432" -d postgres
pg_createsubscriber: error: control file appears to be corrupt

For example, setup a cluster based on v17 and run the command.  This
is confusing, because the control file is not corrupted,
pg_createsubscriber is just missing the fact that the major version it
is compiled with is not able to manipulate the control file of the
target data folder.  We enforce such checks for other tools, like
pg_checksums or pg_rewind, and I don't see why pg_createsubscriber
should be an exception?

One simple thing that could be done is to grab the first line of the
standby's PG_VERSION, at least, so as we have a check that does not
depend on the contents of the control file, and it would unlikely be
corrupted.  Having a check based on PG_CONTROL_VERSION would be a few
lines less, of course, that we could try in parallel of the CRC check
report if the value is sane-ish, from a past version.

Thoughts?
--
Michael

Вложения

Re: Executing pg_createsubscriber with a non-compatible control file

От
Masahiko Sawada
Дата:
On Sun, Oct 5, 2025 at 9:19 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> Hi all,
> (Adding Euler in CC.)
>
> I have mentioned that on this thread:
> https://www.postgresql.org/message-id/aOM-T6iojyzb7OVB@paquier.xyz
>
> Attempting to execute pg_createsubscriber compiled with a version of
> PG_CONTROL_VERSION incompatible with the standby we want to switch to
> a logical replica leads to the following error:
> $ pg_createsubscriber -D $HOME/data/5433 \
>     -P "host=/tmp port=5432" -d postgres
> pg_createsubscriber: error: control file appears to be corrupt
>
> For example, setup a cluster based on v17 and run the command.  This
> is confusing, because the control file is not corrupted,
> pg_createsubscriber is just missing the fact that the major version it
> is compiled with is not able to manipulate the control file of the
> target data folder.  We enforce such checks for other tools, like
> pg_checksums or pg_rewind, and I don't see why pg_createsubscriber
> should be an exception?

+1 to add a major version check for better log messages.

Just to be clear, did you mean that pg_checksums and pg_rewind already
do such checks? IIUC pg_checksums does CRC check for the control file,
and if we execute v18-pg_checksums against v17-cluster we end up with
a similar error but with a different log message like "pg_checksums:
error: pg_control CRC value is incorrect". Those log messages are not
helpful either.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Executing pg_createsubscriber with a non-compatible control file

От
Michael Paquier
Дата:
On Tue, Oct 07, 2025 at 11:51:45AM -0700, Masahiko Sawada wrote:
> Just to be clear, did you mean that pg_checksums and pg_rewind already
> do such checks? IIUC pg_checksums does CRC check for the control file,
> and if we execute v18-pg_checksums against v17-cluster we end up with
> a similar error but with a different log message like "pg_checksums:
> error: pg_control CRC value is incorrect". Those log messages are not
> helpful either.

For the case of pg_checksums, we don't really have a constraint based
on the major version, currently.  However, there is a PG_VERSION_NUM
hardcoded when syncing the data folder, so we are pretty much assuming
that it is the case already.

A check based on PG_VERSION has more benefits in the long-term, IMO.
When the CRC check of the control file fails, it would be tempting to
use some of the contents read from the control file but that would
also mean that we could expose some corrupted values to the user, so
that would not be useful.

How about extracting from pg_rewind what it does with the on-disk
PG_VERSION and create an API that returns a major version number  with
a data folder given in input?  We could then reuse this API for the
other tools, at least pg_createsubscriber and pg_checksums.  I am not
sure if there is a point in backpatching any of that, but it would
lead to more user-friendly errors for all of these.
--
Michael

Вложения

Re: Executing pg_createsubscriber with a non-compatible control file

От
Masahiko Sawada
Дата:
On Tue, Oct 7, 2025 at 5:23 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Oct 07, 2025 at 11:51:45AM -0700, Masahiko Sawada wrote:
> > Just to be clear, did you mean that pg_checksums and pg_rewind already
> > do such checks? IIUC pg_checksums does CRC check for the control file,
> > and if we execute v18-pg_checksums against v17-cluster we end up with
> > a similar error but with a different log message like "pg_checksums:
> > error: pg_control CRC value is incorrect". Those log messages are not
> > helpful either.
>
> For the case of pg_checksums, we don't really have a constraint based
> on the major version, currently.  However, there is a PG_VERSION_NUM
> hardcoded when syncing the data folder, so we are pretty much assuming
> that it is the case already.
>
> A check based on PG_VERSION has more benefits in the long-term, IMO.
> When the CRC check of the control file fails, it would be tempting to
> use some of the contents read from the control file but that would
> also mean that we could expose some corrupted values to the user, so
> that would not be useful.
>
> How about extracting from pg_rewind what it does with the on-disk
> PG_VERSION and create an API that returns a major version number  with
> a data folder given in input?  We could then reuse this API for the
> other tools, at least pg_createsubscriber and pg_checksums.  I am not
> sure if there is a point in backpatching any of that, but it would
> lead to more user-friendly errors for all of these.

+1, sounds like a good idea to improve user experience. I think we can
use the API in pg_combinebackup.c too since it has a similar function,
read_pg_version_file().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Executing pg_createsubscriber with a non-compatible control file

От
Michael Paquier
Дата:
On Thu, Oct 09, 2025 at 11:22:47AM -0700, Masahiko Sawada wrote:
> +1, sounds like a good idea to improve user experience. I think we can
> use the API in pg_combinebackup.c too since it has a similar function,
> read_pg_version_file().

Please find attached what I am finishing with.  At the end, I have
designed an API that can be reused across the board for the following
tools that do the same things in the tree, removing some duplicated
code:
- pg_resetwal
- pg_upgrade
- pg_combinebackup

The routine that retrieves the contents gets a uint32 number, and it
is optionally possible to get the contents of PG_VERSION (pg_upgrade
wants that for tablespace paths, but that's really for pg_resetwal to
be able to show back buggy data):
extern uint32 get_pg_version(const char *datadir, char **version_str);

This support both the pre-v10 and the post-v10 version formats, for
pg_upgrade.

To ease comparisons with PG_MAJORVERSION_NUM, I have added a small
helper macro (see GET_PG_MAJORVERSION_NUM).

I have also applied the same method to pg_createsubscriber, on top of
that, to take care of my original issue.  I have not looked at other
places where the same concept could be applied, at least it's a start.

Thoughts or comments?
--
Michael

Вложения

Re: Executing pg_createsubscriber with a non-compatible control file

От
Masahiko Sawada
Дата:
On Thu, Oct 9, 2025 at 11:08 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Oct 09, 2025 at 11:22:47AM -0700, Masahiko Sawada wrote:
> > +1, sounds like a good idea to improve user experience. I think we can
> > use the API in pg_combinebackup.c too since it has a similar function,
> > read_pg_version_file().
>
> Please find attached what I am finishing with.  At the end, I have
> designed an API that can be reused across the board for the following
> tools that do the same things in the tree, removing some duplicated
> code:
> - pg_resetwal
> - pg_upgrade
> - pg_combinebackup
>
> The routine that retrieves the contents gets a uint32 number, and it
> is optionally possible to get the contents of PG_VERSION (pg_upgrade
> wants that for tablespace paths, but that's really for pg_resetwal to
> be able to show back buggy data):
> extern uint32 get_pg_version(const char *datadir, char **version_str);
>
> This support both the pre-v10 and the post-v10 version formats, for
> pg_upgrade.
>
> To ease comparisons with PG_MAJORVERSION_NUM, I have added a small
> helper macro (see GET_PG_MAJORVERSION_NUM).
>
> I have also applied the same method to pg_createsubscriber, on top of
> that, to take care of my original issue.  I have not looked at other
> places where the same concept could be applied, at least it's a start.
>
> Thoughts or comments?

Thank you for creating the patches!

v1-0001-Introduce-API-able-to-retrieve-contents-of-PG_VER.patch:

LGTM

v1-0002-pg_upgrade-Use-PG_VERSION-generic-routine.patch:

LGTM

v1-0003-pg_createsubscriber-Use-PG_VERSION-generic-routin.patch:

+   major_version = GET_PG_MAJORVERSION_NUM(get_pg_version(datadir, NULL));
+   if (major_version != PG_MAJORVERSION_NUM)
    {
-       pg_fatal("directory \"%s\" is not a database cluster directory",
-                datadir);
+       pg_log_error("data directory is of wrong version");
+       pg_log_error_detail("File \"%s\" contains \"%u\", which is not
compatible with this program's version \"%u\".",
+                           "PG_VERSION", major_version, PG_MAJORVERSION_NUM);
+       exit(1);
    }

The new log detail message uses the same message as what pg_resetwal
uses, but pg_createsubscriber shows an integer major version whereas
pg_resetwal shows the raw version string. I guess it's better to unify
the usage for better consistency.

v1-0004-pg_combinebackup-use-PG_VERSION-generic-routine.patch:

+   pg_log_debug("read server version %u from file \"%s\"",
+                GET_PG_MAJORVERSION_NUM(version), "PG_VERSION");

We used to show the full path of PG_VERSION file in the debug message.
While probably we can live with such a small difference, do you think
it's a good idea to move the debug message to get_pg_version()?

v1-0005-pg_resetwal-use-PG_VERSION-generic-routine.patch:

    /* Check that data directory matches our server version */
-   CheckDataVersion();
+   CheckDataVersion(DataDir);

With the patch, pg_resetwal fails to handle a relative path of PGDATA
as follows:

$ bin/pg_resetwal data
pg_resetwal: error: could not open version file "data/PG_VERSION": No
such file or directory

This is because pg_resetwal does chdir() to the given path of the data
directory before checking the version.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Executing pg_createsubscriber with a non-compatible control file

От
Michael Paquier
Дата:
On Mon, Oct 13, 2025 at 03:35:06PM -0700, Masahiko Sawada wrote:
> v1-0001-Introduce-API-able-to-retrieve-contents-of-PG_VER.patch:
>
> v1-0002-pg_upgrade-Use-PG_VERSION-generic-routine.patch:
> v1-0003-pg_createsubscriber-Use-PG_VERSION-generic-routin.patch:

Applied both of these.

> The new log detail message uses the same message as what pg_resetwal
> uses, but pg_createsubscriber shows an integer major version whereas
> pg_resetwal shows the raw version string. I guess it's better to unify
> the usage for better consistency.

OK, done as suggested to limit the translation work.

> v1-0004-pg_combinebackup-use-PG_VERSION-generic-routine.patch:
>
> +   pg_log_debug("read server version %u from file \"%s\"",
> +                GET_PG_MAJORVERSION_NUM(version), "PG_VERSION");
>
> We used to show the full path of PG_VERSION file in the debug message.
> While probably we can live with such a small difference, do you think
> it's a good idea to move the debug message to get_pg_version()?

I cannot get into doing that, leaving that up to the caller with the
error message they want.  That's a minor point, I guess, I can see
both sides of the coin.

Switched this one to report the full path, like previously.

> v1-0005-pg_resetwal-use-PG_VERSION-generic-routine.patch:
>
>     /* Check that data directory matches our server version */
> -   CheckDataVersion();
> +   CheckDataVersion(DataDir);
>
> With the patch, pg_resetwal fails to handle a relative path of PGDATA
> as follows:
>
> $ bin/pg_resetwal data
> pg_resetwal: error: could not open version file "data/PG_VERSION": No
> such file or directory
>
> This is because pg_resetwal does chdir() to the given path of the data
> directory before checking the version.

Right.  I've tested with absolute paths and forgot relative paths.
For this one, I would just use a ".", as the chdir is before the
version check.  Or we could reverse the chdir() and the version check,
but there is no real benefit in doing so.

Updated patch set attached.
--
Michael

Вложения

Re: Executing pg_createsubscriber with a non-compatible control file

От
Masahiko Sawada
Дата:
On Tue, Oct 14, 2025 at 1:04 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Oct 13, 2025 at 03:35:06PM -0700, Masahiko Sawada wrote:
> > v1-0001-Introduce-API-able-to-retrieve-contents-of-PG_VER.patch:
> >
> > v1-0002-pg_upgrade-Use-PG_VERSION-generic-routine.patch:
> > v1-0003-pg_createsubscriber-Use-PG_VERSION-generic-routin.patch:
>
> Applied both of these.
>
> > The new log detail message uses the same message as what pg_resetwal
> > uses, but pg_createsubscriber shows an integer major version whereas
> > pg_resetwal shows the raw version string. I guess it's better to unify
> > the usage for better consistency.
>
> OK, done as suggested to limit the translation work.
>
> > v1-0004-pg_combinebackup-use-PG_VERSION-generic-routine.patch:
> >
> > +   pg_log_debug("read server version %u from file \"%s\"",
> > +                GET_PG_MAJORVERSION_NUM(version), "PG_VERSION");
> >
> > We used to show the full path of PG_VERSION file in the debug message.
> > While probably we can live with such a small difference, do you think
> > it's a good idea to move the debug message to get_pg_version()?
>
> I cannot get into doing that, leaving that up to the caller with the
> error message they want.  That's a minor point, I guess, I can see
> both sides of the coin.
>
> Switched this one to report the full path, like previously.
>
> > v1-0005-pg_resetwal-use-PG_VERSION-generic-routine.patch:
> >
> >     /* Check that data directory matches our server version */
> > -   CheckDataVersion();
> > +   CheckDataVersion(DataDir);
> >
> > With the patch, pg_resetwal fails to handle a relative path of PGDATA
> > as follows:
> >
> > $ bin/pg_resetwal data
> > pg_resetwal: error: could not open version file "data/PG_VERSION": No
> > such file or directory
> >
> > This is because pg_resetwal does chdir() to the given path of the data
> > directory before checking the version.
>
> Right.  I've tested with absolute paths and forgot relative paths.
> For this one, I would just use a ".", as the chdir is before the
> version check.  Or we could reverse the chdir() and the version check,
> but there is no real benefit in doing so.
>
> Updated patch set attached.

Thank you for updating the patch!

All patches look good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Executing pg_createsubscriber with a non-compatible control file

От
Michael Paquier
Дата:
On Tue, Oct 14, 2025 at 11:54:26AM -0700, Masahiko Sawada wrote:
> All patches look good to me.

Thanks.  I have applied all that after a second look, not planning to
bother about the back branches for pg_createsubscriber.

We may be able to apply the version rules to more tools.  pg_checksums
and pg_rewind have hardcoded requirement based on PG_VERSION_NUM.  Not
sure if it's worth bothering about them, though.
--
Michael

Вложения