Обсуждение: pg_verifybackup: TAR format backup verification

Поиск
Список
Период
Сортировка

pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
Hi,

Currently, pg_verifybackup only works with plain (directory) format backups.
This proposal aims to support tar-format backups too.  We will read the tar
files from start to finish and verify each file inside against the
backup_manifest information, similar to how it verifies plain files.

We are introducing new options to pg_verifybackup: -F, --format=p|t and -Z,
--compress=METHOD, which allow users to specify the backup format and
compression type, similar to the options available in pg_basebackup. If these
options are not provided, the backup format and compression type will be
automatically detected.  To determine the format, we will search for PG_VERSION
file in the backup directory — if found, it indicates a plain backup;
otherwise, it
is a tar-format backup.  For the compression type, we will check the extension
of base.tar.xxx file of tar-format backup. Refer to patch 0008 for the details.

The main challenge is to structure the code neatly.  For plain-format backups,
we verify bytes directly from the files.  For tar-format backups, we read bytes
from the tar file of the specific file we care about.  We need an abstraction
to handle both formats smoothly, without using many if statements or special
cases.

To achieve this goal, we need to reuse existing infrastructure without
duplicating code, and for that, the major work involved here is the code
refactoring. Here is a breakdown of the work:

1. BBSTREAMER Rename and Relocate:
BBSTREAMER, currently used in pg_basebackup for reading and decompressing TAR
files; can also be used for pg_verifybackup. In the future, it could support
other tools like pg_combinebackup for merging TAR backups without extraction,
and pg_waldump for verifying WAL files from the tar backup.  For that
accessibility,
BBSTREAMER needs to be relocated to a shared directory.

Moreover, renaming BBSTREAMER to ASTREAMER (short for Archive Streamer) would
better indicate its general application across multiple tools. Moving it to
src/fe_utils directory is appropriate, given its frontend infrastructure use.

2. pg_verifybackup Code Refactoring:
The existing code for plain backup verification will be split into separate
files or functions, so it can also be reused for tar backup verification.

3. Adding TAR Backup Verification:
Finally, patches will be added to implement TAR backup verification, along with
tests and documentation.

Patches 0001-0003 focus on renaming and relocating BBSTREAMER, patches
0004-0007 on splitting the existing verification code, and patches 0008-0010 on
adding TAR backup verification capabilities, tests, and documentation. The last
set could be a single patch but is split to make the review easier.

Please take a look at the attached patches and share your comments,
suggestions, or any ways to enhance them. Your feedback is greatly
appreciated.

Thank you !

--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Sravan Kumar
Дата:
Hi Amul,
     thanks for working on this.

+ file_name_len = strlen(relpath);
+ if (file_name_len < file_extn_len ||
+ strcmp(relpath + file_name_len - file_extn_len, file_extn) != 0)
+ {
+ if (compress_algorithm == PG_COMPRESSION_NONE)
+ report_backup_error(context,
+ "\"%s\" is not a valid file, expecting tar file",
+ relpath);
+ else
+ report_backup_error(context,
+ "\"%s\" is not a valid file, expecting \"%s\" compressed tar file",
+ relpath,
+ get_compress_algorithm_name(compress_algorithm));
+ return;
+ }

I believe pg_verifybackup needs to exit after reporting a failure here since it could not figure out a streamer to allocate.

Also, v1-0002 removes #include "pqexpbuffer.h" from astreamer.h and adds it to the new .h file and in v1-0004 it
reverts the change. So this can be avoided altogether. 

On Tue, Jul 9, 2024 at 3:24 PM Amul Sul <sulamul@gmail.com> wrote:
Hi,

Currently, pg_verifybackup only works with plain (directory) format backups.
This proposal aims to support tar-format backups too.  We will read the tar
files from start to finish and verify each file inside against the
backup_manifest information, similar to how it verifies plain files.

We are introducing new options to pg_verifybackup: -F, --format=p|t and -Z,
--compress=METHOD, which allow users to specify the backup format and
compression type, similar to the options available in pg_basebackup. If these
options are not provided, the backup format and compression type will be
automatically detected.  To determine the format, we will search for PG_VERSION
file in the backup directory — if found, it indicates a plain backup;
otherwise, it
is a tar-format backup.  For the compression type, we will check the extension
of base.tar.xxx file of tar-format backup. Refer to patch 0008 for the details.

The main challenge is to structure the code neatly.  For plain-format backups,
we verify bytes directly from the files.  For tar-format backups, we read bytes
from the tar file of the specific file we care about.  We need an abstraction
to handle both formats smoothly, without using many if statements or special
cases.

To achieve this goal, we need to reuse existing infrastructure without
duplicating code, and for that, the major work involved here is the code
refactoring. Here is a breakdown of the work:

1. BBSTREAMER Rename and Relocate:
BBSTREAMER, currently used in pg_basebackup for reading and decompressing TAR
files; can also be used for pg_verifybackup. In the future, it could support
other tools like pg_combinebackup for merging TAR backups without extraction,
and pg_waldump for verifying WAL files from the tar backup.  For that
accessibility,
BBSTREAMER needs to be relocated to a shared directory.

Moreover, renaming BBSTREAMER to ASTREAMER (short for Archive Streamer) would
better indicate its general application across multiple tools. Moving it to
src/fe_utils directory is appropriate, given its frontend infrastructure use.

2. pg_verifybackup Code Refactoring:
The existing code for plain backup verification will be split into separate
files or functions, so it can also be reused for tar backup verification.

3. Adding TAR Backup Verification:
Finally, patches will be added to implement TAR backup verification, along with
tests and documentation.

Patches 0001-0003 focus on renaming and relocating BBSTREAMER, patches
0004-0007 on splitting the existing verification code, and patches 0008-0010 on
adding TAR backup verification capabilities, tests, and documentation. The last
set could be a single patch but is split to make the review easier.

Please take a look at the attached patches and share your comments,
suggestions, or any ways to enhance them. Your feedback is greatly
appreciated.

Thank you !

--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com


--
Thanks & Regards,
Sravan Velagandula
The Enterprise PostgreSQL Company

Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Mon, Jul 22, 2024 at 8:29 AM Sravan Kumar <sravanvcybage@gmail.com> wrote:
>
> Hi Amul,
>      thanks for working on this.
>

Thanks, for your review.

>> + file_name_len = strlen(relpath);
>> + if (file_name_len < file_extn_len ||
>> + strcmp(relpath + file_name_len - file_extn_len, file_extn) != 0)
>> + {
>> + if (compress_algorithm == PG_COMPRESSION_NONE)
>> + report_backup_error(context,
>> + "\"%s\" is not a valid file, expecting tar file",
>> + relpath);
>> + else
>> + report_backup_error(context,
>> + "\"%s\" is not a valid file, expecting \"%s\" compressed tar file",
>> + relpath,
>> + get_compress_algorithm_name(compress_algorithm));
>> + return;
>> + }
>
>
> I believe pg_verifybackup needs to exit after reporting a failure here since it could not figure out a streamer to
allocate.
>
The intention here is to continue the verification of the remaining tar files
instead of exiting immediately in case of an error. If the user prefers an
immediate exit, they can use the --exit-on-error option of pg_verifybackup.


> Also, v1-0002 removes #include "pqexpbuffer.h" from astreamer.h and adds it to the new .h file and in v1-0004 it
> reverts the change. So this can be avoided altogether.
>
Fix in the attached version.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Mon, Jul 22, 2024 at 7:53 AM Amul Sul <sulamul@gmail.com> wrote:
> Fix in the attached version.

First of all, in the interest of full disclosure, I suggested this
project to Amul, so I'm +1 on the concept. I think making more of our
backup-related tools work with tar and compressed tar formats -- and
perhaps eventually data not stored locally -- will make them a lot
more usable. If, for example, you take a full backup and an
incremental backup, each in tar format, store them in the cloud
someplace, and then want to verify and afterwards restore the
incremental backup, you would need to download the tar files from the
cloud, then extract all the tar files, then run pg_verifybackup and
pg_combinebackup over the results. With this patch set, and similar
work for pg_combinebackup, you could skip the step where you need to
extract the tar files, saving significant amounts of time and disk
space. If the tools also had the ability to access data remotely, you
could save even more, but that's a much harder project, so it makes
sense to me to start with this.

Second, I think this patch set is quite well-organized and easy to
read. That's not to say there is nothing in these patches to which
someone might object, but it seems to me that it should at least be
simple for anyone who wants to review to find the things to which they
object in the patch set without having to spend too much time on it,
which is fantastic.

Third, I think the general approach that these patches take to the
problem - namely, renaming bbstreamer to astreamer and moving it
somewhere that permits it to be reused - makes a lot of sense. To be
honest, I don't think I had it in mind that bbstreamer would be a
reusable component when I wrote it, or if I did have it in mind, it
was off in some dusty corner of my mind that doesn't get visited very
often. I was imagining that you would need to build new infrastructure
to deal with reading the tar file, but I think what you've done here
is better. Reusing the bbstreamer stuff gives you tar file parsing,
and decompression if necessary, basically for free, and IMHO the
result looks rather elegant.

However, I'm not very convinced by 0003. The handling between the
meson and make-based build systems doesn't seem consistent. On the
meson side, you just add the objects to the same list that contains
all of the other files (but not in alphabetical order, which should be
fixed). But on the make side, you for some reason invent a separate
AOBJS list instead of just adding the files to OBJS. I don't think it
should be necessary to treat these objects any differently from any
other objects, so they should be able to just go in OBJS: but if it
were necessary, then I feel like the meson side would need something
similar.

Also, I'm not so sure about this change to src/fe_utils/meson.build:

-  dependencies: frontend_common_code,
+  dependencies: [frontend_common_code, lz4, zlib, zstd],

frontend_common_code already includes dependencies on zlib and zstd,
so we probably don't need to add those again here. I checked the
result of otool -L src/bin/pg_controldata/pg_controldata from the
meson build directory, and I find that currently it links against libz
and libzstd but not liblz4. However, if I either make this line say
dependencies: [frontend_common_code, lz4] or if I just update
frontend_common_code to include lz4, then it starts linking against
liblz4 as well. I'm not entirely sure if there's any reason to do one
or the other of those things, but I think I'd be inclined to make
frontend_common_code just include lz4 since it already includes zlib
and zstd anyway, and then you don't need this change.

Alternatively, we could take the position that we need to avoid having
random front-end tools that don't do anything with compression at all,
like pg_controldata for example, to link with compression libraries at
all. But then we'd need to rethink a bunch of things that have not
much to do with this patch.

Regarding 0004, I would rather not move show_progress and
skip_checksums to the new header file. I suppose I was a bit lazy in
making these file-level global variables instead of passing them down
using function arguments and/or a context object, but at least right
now they're only global within a single file. Can we think of
inserting a preparatory patch that moves these into verifier_context?

Regarding 0005, the comment /* Check whether there's an entry in the
manifest hash. */ should move inside verify_manifest_entry, where
manifest_files_lookup is called. The call to the new function
verify_manifest_entry() needs its own, proper comment. Also, I think
there's a null-pointer deference hazard here, because
verify_manifest_entry() can return NULL but the "Validate the manifest
system identifier" chunk assumes it isn't. I think you could hit this
- and presumably seg fault - if pg_control is on disk but not in the
manifest.  Seems like just adding an m != NULL test is easiest, but
see also below comments about 0006.

Regarding 0006, suppose that the member file within the tar archive is
longer than expected. With the unpatched code, we'll feed all of the
data to the checksum context, but then, after the read-loop
terminates, we'll complain about the file being the wrong length. With
the patched code, we'll complain about the checksum mismatch before
returning from verify_content_checksum(). I think that's an unintended
behavior change, and I think the new behavior is worse than the old
behavior. But also, I think that in the case of a tar file, the
desired behavior is quite different. In that case, we know the length
of the file from the member header, so we can check whether the length
is as expected before we read any of the data bytes. If we discover
that the size is wrong, we can complain about that and need not feed
the checksum bytes to the checksum context at all -- we can just skip
them, which will be faster. That approach doesn't really make sense
for a file, because even if we were to stat() the file before we
started reading it, the length could theoretically change as we are
reading it, if someone is concurrently modifying it, but for a tar
file I think it does.

I would suggest abandoning this refactoring. There's very little logic
in verify_file_checksum() that you can actually reuse. I think you
should just duplicate the code. If you really want, you could arrange
to reuse the error-reporting code that checks for checksumlen !=
m->checksum_length and memcmp(checksumbuf, m->checksum_payload,
checksumlen) != 0, but even that I think is little enough that it's
fine to just duplicate it. The rest is either (1) OS calls like
open(), read(), etc. which won't be applicable to the
read-from-archive case or (2) calls to pg_checksum_WHATEVER, which are
fine to just duplicate, IMHO.

My eyes are starting to glaze over a bit here so expect comments below
this point to be only a partial review of the corresponding patch.

Regarding 0007, I think that the should_verify_sysid terminology is
problematic. I made all the code and identifier names talk only about
the control file, not the specific thing in the control file that we
are going to verify, in case in the future we want to verify
additional things. This breaks that abstraction.

Regarding 0009, I feel like astreamer_verify_content() might want to
grow some subroutines. One idea could be to move the
ASTREAMER_MEMBER_HEADER case and likewise ASTREAMER_MEMBER_CONTENTS
cases into a new function for each; another idea could be to move
smaller chunks of logic, e.g. under the ASTREAMER_MEMBER_CONTENTS
case, the verify_checksums could be one subroutine and the ill-named
verify_sysid stuff could be another. I'm not certain exactly what's
best here, but some of this code is as deeply as six levels nested,
which is not such a terrible thing that nobody should ever do it, but
it is bad enough that we should at least look around for a better way.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Tue, Jul 30, 2024 at 9:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 7:53 AM Amul Sul <sulamul@gmail.com> wrote:
> > Fix in the attached version.
>
> First of all, in the interest of full disclosure, I suggested this
> project to Amul, so I'm +1 on the concept. I think making more of our
> backup-related tools work with tar and compressed tar formats -- and
> perhaps eventually data not stored locally -- will make them a lot
> more usable. If, for example, you take a full backup and an
> incremental backup, each in tar format, store them in the cloud
> someplace, and then want to verify and afterwards restore the
> incremental backup, you would need to download the tar files from the
> cloud, then extract all the tar files, then run pg_verifybackup and
> pg_combinebackup over the results. With this patch set, and similar
> work for pg_combinebackup, you could skip the step where you need to
> extract the tar files, saving significant amounts of time and disk
> space. If the tools also had the ability to access data remotely, you
> could save even more, but that's a much harder project, so it makes
> sense to me to start with this.
>
> Second, I think this patch set is quite well-organized and easy to
> read. That's not to say there is nothing in these patches to which
> someone might object, but it seems to me that it should at least be
> simple for anyone who wants to review to find the things to which they
> object in the patch set without having to spend too much time on it,
> which is fantastic.
>
> Third, I think the general approach that these patches take to the
> problem - namely, renaming bbstreamer to astreamer and moving it
> somewhere that permits it to be reused - makes a lot of sense. To be
> honest, I don't think I had it in mind that bbstreamer would be a
> reusable component when I wrote it, or if I did have it in mind, it
> was off in some dusty corner of my mind that doesn't get visited very
> often. I was imagining that you would need to build new infrastructure
> to deal with reading the tar file, but I think what you've done here
> is better. Reusing the bbstreamer stuff gives you tar file parsing,
> and decompression if necessary, basically for free, and IMHO the
> result looks rather elegant.
>

Thank you so much for the summary and the review.

> However, I'm not very convinced by 0003. The handling between the
> meson and make-based build systems doesn't seem consistent. On the
> meson side, you just add the objects to the same list that contains
> all of the other files (but not in alphabetical order, which should be
> fixed). But on the make side, you for some reason invent a separate
> AOBJS list instead of just adding the files to OBJS. I don't think it
> should be necessary to treat these objects any differently from any
> other objects, so they should be able to just go in OBJS: but if it
> were necessary, then I feel like the meson side would need something
> similar.
>

Fixed -- I did that because it was part of a separate group in pg_basebackup.

> Also, I'm not so sure about this change to src/fe_utils/meson.build:
>
> -  dependencies: frontend_common_code,
> +  dependencies: [frontend_common_code, lz4, zlib, zstd],
>
> frontend_common_code already includes dependencies on zlib and zstd,
> so we probably don't need to add those again here. I checked the
> result of otool -L src/bin/pg_controldata/pg_controldata from the
> meson build directory, and I find that currently it links against libz
> and libzstd but not liblz4. However, if I either make this line say
> dependencies: [frontend_common_code, lz4] or if I just update
> frontend_common_code to include lz4, then it starts linking against
> liblz4 as well. I'm not entirely sure if there's any reason to do one
> or the other of those things, but I think I'd be inclined to make
> frontend_common_code just include lz4 since it already includes zlib
> and zstd anyway, and then you don't need this change.
>

Fixed -- frontend_common_code now includes lz4 as well.

> Alternatively, we could take the position that we need to avoid having
> random front-end tools that don't do anything with compression at all,
> like pg_controldata for example, to link with compression libraries at
> all. But then we'd need to rethink a bunch of things that have not
> much to do with this patch.
>

Noted. I might give it a try another day, unless someone else beats
me, perhaps in a separate thread.

> Regarding 0004, I would rather not move show_progress and
> skip_checksums to the new header file. I suppose I was a bit lazy in
> making these file-level global variables instead of passing them down
> using function arguments and/or a context object, but at least right
> now they're only global within a single file. Can we think of
> inserting a preparatory patch that moves these into verifier_context?
>

Done -- added a new patch as 0004, and the subsequent patch numbers
have been incremented accordingly.

> Regarding 0005, the comment /* Check whether there's an entry in the
> manifest hash. */ should move inside verify_manifest_entry, where
> manifest_files_lookup is called. The call to the new function
> verify_manifest_entry() needs its own, proper comment. Also, I think
> there's a null-pointer deference hazard here, because
> verify_manifest_entry() can return NULL but the "Validate the manifest
> system identifier" chunk assumes it isn't. I think you could hit this
> - and presumably seg fault - if pg_control is on disk but not in the
> manifest.  Seems like just adding an m != NULL test is easiest, but
> see also below comments about 0006.
>

Fixed -- I did the NULL check in the earlier 0007 patch, but it should
have been done in this patch.

> Regarding 0006, suppose that the member file within the tar archive is
> longer than expected. With the unpatched code, we'll feed all of the
> data to the checksum context, but then, after the read-loop
> terminates, we'll complain about the file being the wrong length. With
> the patched code, we'll complain about the checksum mismatch before
> returning from verify_content_checksum(). I think that's an unintended
> behavior change, and I think the new behavior is worse than the old
> behavior. But also, I think that in the case of a tar file, the
> desired behavior is quite different. In that case, we know the length
> of the file from the member header, so we can check whether the length
> is as expected before we read any of the data bytes. If we discover
> that the size is wrong, we can complain about that and need not feed
> the checksum bytes to the checksum context at all -- we can just skip
> them, which will be faster. That approach doesn't really make sense
> for a file, because even if we were to stat() the file before we
> started reading it, the length could theoretically change as we are
> reading it, if someone is concurrently modifying it, but for a tar
> file I think it does.
>

In the case of a file size mismatch, we never reach the point where
checksum calculation is performed, because verify_manifest_entry()
encounters an error and sets manifest_file->bad to true, which causes
skip_checksum to be set to false. For that reason, I didn’t include
the size check again in the checksum calculation part. This behavior
is the same for plain backups, but the additional file size check was
added as a precaution (per comment in verify_file_checksum()),
possibly for the same reasons you mentioned.

I agree, changing the order of errors could create confusion.
Previously, a file size mismatch was a clear and appropriate error
that was reported before the checksum failure error.

However, this can be fixed by delaying the checksum calculation until
the expected file content size is received. Specifically, return from
verify_content_checksum(), if  (*computed_len != m->size). If the file
size is incorrect, the checksum calculation won't be performed, and
the caller's loop reading file (I mean in verify_file_checksum()) will
exit at some point which later encounters the size mismatch error.

> I would suggest abandoning this refactoring. There's very little logic
> in verify_file_checksum() that you can actually reuse. I think you
> should just duplicate the code. If you really want, you could arrange
> to reuse the error-reporting code that checks for checksumlen !=
> m->checksum_length and memcmp(checksumbuf, m->checksum_payload,
> checksumlen) != 0, but even that I think is little enough that it's
> fine to just duplicate it. The rest is either (1) OS calls like
> open(), read(), etc. which won't be applicable to the
> read-from-archive case or (2) calls to pg_checksum_WHATEVER, which are
> fine to just duplicate, IMHO.
>

I kept the refactoring as it is by fixing verify_content_checksum() as
mentioned in the previous paragraph. Please let me know if this fix
and the explanation makes sense to you. I’m okay with abandoning this
refactor patch if you think.

> My eyes are starting to glaze over a bit here so expect comments below
> this point to be only a partial review of the corresponding patch.
>
> Regarding 0007, I think that the should_verify_sysid terminology is
> problematic. I made all the code and identifier names talk only about
> the control file, not the specific thing in the control file that we
> are going to verify, in case in the future we want to verify
> additional things. This breaks that abstraction.
>

Agreed, changed to should_verify_control_data.

> Regarding 0009, I feel like astreamer_verify_content() might want to
> grow some subroutines. One idea could be to move the
> ASTREAMER_MEMBER_HEADER case and likewise ASTREAMER_MEMBER_CONTENTS
> cases into a new function for each; another idea could be to move
> smaller chunks of logic, e.g. under the ASTREAMER_MEMBER_CONTENTS
> case, the verify_checksums could be one subroutine and the ill-named
> verify_sysid stuff could be another. I'm not certain exactly what's
> best here, but some of this code is as deeply as six levels nested,
> which is not such a terrible thing that nobody should ever do it, but
> it is bad enough that we should at least look around for a better way.
>

Okay, I added the verify_checksums() and verify_controldata()
functions to the astreamer_verify.c file. I also updated related
variables that were clashing with these function names:
verify_checksums has been renamed to verifyChecksums, and verify_sysid
has been renamed to verifyControlData.

Thanks again for the review comments. Please have a look at the
attached version.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote:
> Fixed -- I did that because it was part of a separate group in pg_basebackup.

Well, that's because pg_basebackup builds multiple executables, and
these files needed to be linked with some but not others. It looks
like when Andres added meson support, instead of linking each object
file into the binaries that need it, he had it just build a static
library and link every executable to that. That puts the linker in
charge of sorting out which binaries need which files, instead of
having the makefile do it. In any case, this consideration doesn't
apply when we're putting the object files into a library, so there was
no need to preserve the separate makefile variable. I think this looks
good now.

> Fixed -- frontend_common_code now includes lz4 as well.

Cool. 0003 overall looks good to me now, unless Andres wants to object.

> Noted. I might give it a try another day, unless someone else beats
> me, perhaps in a separate thread.

Probably not too important, since nobody has complained.

> Done -- added a new patch as 0004, and the subsequent patch numbers
> have been incremented accordingly.

I think I would have made this pass context->show_progress to
progress_report() instead of the whole verifier_context, but that's an
arguable stylistic choice, so I'll defer to you if you prefer it the
way you have it. Other than that, this LGTM.

However, what is now 0005 does something rather evil. The commit
message claims that it's just rearranging code, and that's almost
entirely true, except that it also changes manifest_file's pathname
member to be char * instead of const char *. I do not think that is a
good idea, and I definitely do not think you should do it in a patch
that purports to just be doing code movement, and I even more
definitely think that you should not do it without even mentioning
that you did it, and why you did it.

> Fixed -- I did the NULL check in the earlier 0007 patch, but it should
> have been done in this patch.

This is now 0006. struct stat's st_size is of type off_t -- or maybe
ssize_t on some platforms? - not type size_t. I suggest making the
filesize argument use int64 as we do in some other places. size_t is,
I believe, defined to be the right width to hold the size of an object
in memory, not the size of a file on disk, so it isn't really relevant
here.

Other than that, my only comment on this patch is that I think I would
find it more natural to write the check in verify_backup_file() in a
different order: I'd put context->manifest->version != 1 && m != NULL
&& m->matched && !m->bad && strcmp() because (a) that way the most
expensive test is last and (b) it feels weird to think about whether
we have the right pathname if we don't even have a valid manifest
entry. But this is minor and just a stylistic preference, so it's also
OK as you have it if you prefer.

> I agree, changing the order of errors could create confusion.
> Previously, a file size mismatch was a clear and appropriate error
> that was reported before the checksum failure error.

In my opinion, this patch (currently 0007) creates a rather confusing
situation that I can't easily reason about. Post-patch,
verify_content_checksum() is a mix of two different things: it ends up
containing all of the logic that needs to be performed on every chunk
of bytes read from the file plus some but not all of the end-of-file
error-checks from verify_file_checksum(). That's really weird. I'm not
very convinced that the test for whether we've reached the end of the
file is 100% correct, but even if it is, the stuff before that point
is stuff that is supposed to happen many times and the stuff after
that is only supposed to happen once, and I don't see any good reason
to smush those two categories of things into a single function. Plus,
changing the order in which those end-of-file checks happen doesn't
seem like the right idea either: the current ordering is good the way
it is. Maybe you want to think of refactoring to create TWO new
functions, one to do the per-hunk work and a second to do the
end-of-file "is the checksum OK?" stuff, or maybe you can just open
code it, but I'm not willing to commit this the way it is.

Regarding 0008, I don't really see a reason why the m != NULL
shouldn't also move inside should_verify_control_data(). Yeah, the
caller added in 0010 might not need the check, but it won't really
cost anything. Also, it seems to me that the logic in 0010 is actually
wrong. If m == NULL, we'll keep the values of verifyChecksums and
verifyControlData from the previous iteration, whereas what we should
do is make them both false. How about removing the if m == NULL guard
here and making both should_verify_checksum() and
should_verify_control_data() test m != NULL internally? Then it all
works out nicely, I think. Or alternatively you need an else clause
that resets both values to false when m == NULL.

> Okay, I added the verify_checksums() and verify_controldata()
> functions to the astreamer_verify.c file. I also updated related
> variables that were clashing with these function names:
> verify_checksums has been renamed to verifyChecksums, and verify_sysid
> has been renamed to verifyControlData.

Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also.

Out of time for today, will look again soon. I think the first few of
these are probably pretty much ready for commit already, and with a
little more adjustment they'll probably be ready up through about
0006.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Andres Freund
Дата:
Hi,

On 2024-07-31 16:07:03 -0400, Robert Haas wrote:
> On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote:
> > Fixed -- I did that because it was part of a separate group in pg_basebackup.
> 
> Well, that's because pg_basebackup builds multiple executables, and
> these files needed to be linked with some but not others. It looks
> like when Andres added meson support, instead of linking each object
> file into the binaries that need it, he had it just build a static
> library and link every executable to that. That puts the linker in
> charge of sorting out which binaries need which files, instead of
> having the makefile do it.

Right. Meson supports using the same file with different compilation flags,
depending on the context its used (i.e. as part of an executable or a shared
library). But that also ends up compiling files multiple times when using the
same file in multiple binaries. Which wasn't desirable here -> hence moving it
to a static lib.


> > Fixed -- frontend_common_code now includes lz4 as well.
> 
> Cool. 0003 overall looks good to me now, unless Andres wants to object.

Nope.

Greetings,

Andres Freund



Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Thu, Aug 1, 2024 at 1:37 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote:
> > Fixed -- I did that because it was part of a separate group in pg_basebackup.
>
> Well, that's because pg_basebackup builds multiple executables, and
> these files needed to be linked with some but not others. It looks
> like when Andres added meson support, instead of linking each object
> file into the binaries that need it, he had it just build a static
> library and link every executable to that. That puts the linker in
> charge of sorting out which binaries need which files, instead of
> having the makefile do it. In any case, this consideration doesn't
> apply when we're putting the object files into a library, so there was
> no need to preserve the separate makefile variable. I think this looks
> good now.
>

Understood.

> > Fixed -- frontend_common_code now includes lz4 as well.
>
> Cool. 0003 overall looks good to me now, unless Andres wants to object.
>
> > Noted. I might give it a try another day, unless someone else beats
> > me, perhaps in a separate thread.
>
> Probably not too important, since nobody has complained.
>
> > Done -- added a new patch as 0004, and the subsequent patch numbers
> > have been incremented accordingly.
>
> I think I would have made this pass context->show_progress to
> progress_report() instead of the whole verifier_context, but that's an
> arguable stylistic choice, so I'll defer to you if you prefer it the
> way you have it. Other than that, this LGTM.
>

Additionally, I moved total_size and done_size to verifier_context
because done_size needs to be accessed in astreamer_verify.c.
With  this change, verifier_context is now more suitable.

> However, what is now 0005 does something rather evil. The commit
> message claims that it's just rearranging code, and that's almost
> entirely true, except that it also changes manifest_file's pathname
> member to be char * instead of const char *. I do not think that is a
> good idea, and I definitely do not think you should do it in a patch
> that purports to just be doing code movement, and I even more
> definitely think that you should not do it without even mentioning
> that you did it, and why you did it.
>

True, that was a mistake on my part during the rebase. Fixed in the
attached version.

> > Fixed -- I did the NULL check in the earlier 0007 patch, but it should
> > have been done in this patch.
>
> This is now 0006. struct stat's st_size is of type off_t -- or maybe
> ssize_t on some platforms? - not type size_t. I suggest making the
> filesize argument use int64 as we do in some other places. size_t is,
> I believe, defined to be the right width to hold the size of an object
> in memory, not the size of a file on disk, so it isn't really relevant
> here.
>

Ok, used int64.

> Other than that, my only comment on this patch is that I think I would
> find it more natural to write the check in verify_backup_file() in a
> different order: I'd put context->manifest->version != 1 && m != NULL
> && m->matched && !m->bad && strcmp() because (a) that way the most
> expensive test is last and (b) it feels weird to think about whether
> we have the right pathname if we don't even have a valid manifest
> entry. But this is minor and just a stylistic preference, so it's also
> OK as you have it if you prefer.
>

I used to do it that way (a) -- keeping the expensive check for last.
I did the same thing while adding should_verify_control_data() in the
later patch. Somehow, I missed it here, maybe I didn't pay enough
attention to this patch :(

> > I agree, changing the order of errors could create confusion.
> > Previously, a file size mismatch was a clear and appropriate error
> > that was reported before the checksum failure error.
>
> In my opinion, this patch (currently 0007) creates a rather confusing
> situation that I can't easily reason about. Post-patch,
> verify_content_checksum() is a mix of two different things: it ends up
> containing all of the logic that needs to be performed on every chunk
> of bytes read from the file plus some but not all of the end-of-file
> error-checks from verify_file_checksum(). That's really weird. I'm not
> very convinced that the test for whether we've reached the end of the
> file is 100% correct, but even if it is, the stuff before that point
> is stuff that is supposed to happen many times and the stuff after
> that is only supposed to happen once, and I don't see any good reason
> to smush those two categories of things into a single function. Plus,
> changing the order in which those end-of-file checks happen doesn't
> seem like the right idea either: the current ordering is good the way
> it is. Maybe you want to think of refactoring to create TWO new
> functions, one to do the per-hunk work and a second to do the
> end-of-file "is the checksum OK?" stuff, or maybe you can just open
> code it, but I'm not willing to commit this the way it is.
>

Understood. At the start of working on the v3 review, I thought of
completely discarding the 0007 patch and copying most of
verify_file_checksum() to a new function in astreamer_verify.c.
However, I later realized we could deduplicate some parts, so I split
verify_file_checksum() and moved the reusable part to a separate
function. Please have a look at v4-0007.

> Regarding 0008, I don't really see a reason why the m != NULL
> shouldn't also move inside should_verify_control_data(). Yeah, the
> caller added in 0010 might not need the check, but it won't really
> cost anything. Also, it seems to me that the logic in 0010 is actually
> wrong. If m == NULL, we'll keep the values of verifyChecksums and
> verifyControlData from the previous iteration, whereas what we should
> do is make them both false. How about removing the if m == NULL guard
> here and making both should_verify_checksum() and
> should_verify_control_data() test m != NULL internally? Then it all
> works out nicely, I think. Or alternatively you need an else clause
> that resets both values to false when m == NULL.
>

I had the same thought about checking for NULL inside
should_verify_control_data(), but I wanted to maintain the structure
similar to should_verify_checksum(). Making this change would have
also required altering should_verify_checksum(), I wasn’t sure if I
should make that change before. Now, I did that in the attached
version -- 0008 patch.

> > Okay, I added the verify_checksums() and verify_controldata()
> > functions to the astreamer_verify.c file. I also updated related
> > variables that were clashing with these function names:
> > verify_checksums has been renamed to verifyChecksums, and verify_sysid
> > has been renamed to verifyControlData.
>
> Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also.
>

Done.

> Out of time for today, will look again soon. I think the first few of
> these are probably pretty much ready for commit already, and with a
> little more adjustment they'll probably be ready up through about
> 0006.
>

Sure, thank you.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Thu, Aug 1, 2024 at 6:48 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 1:37 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote:
> > > Fixed -- I did that because it was part of a separate group in pg_basebackup.
> >
[...]
> > Out of time for today, will look again soon. I think the first few of
> > these are probably pretty much ready for commit already, and with a
> > little more adjustment they'll probably be ready up through about
> > 0006.
> >
>
> Sure, thank you.
>

The v4 version isn't handling the progress report correctly because
the total_size calculation was done in verify_manifest_entry(), and
done_size was updated during the checksum verification. This worked
well for the plain backup but failed for the tar backup, where
checksum verification occurs right after verify_manifest_entry(),
leading to incorrect total_size in the progress report output.

Additionally, the patch missed the final progress_report(true) call
for persistent output, which is called from verify_backup_checksums()
for the plain backup but never for tar backup verification. To address
this, I moved the first and last progress_report() calls to the main
function. Although this is a small change, I placed it in a separate
patch, 0009, in the attached version.

In addition to these changes, the attached version includes
improvements in code comments, function names, and their arrangements
in astreamer_verify.c.

Please consider the attached version for the review.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Fri, Aug 2, 2024 at 7:43 AM Amul Sul <sulamul@gmail.com> wrote:
> Please consider the attached version for the review.

Thanks. I committed 0001-0003. The only thing that I changed was that
in 0001, you forgot to pgindent, which actually mattered quite a bit,
because astreamer is one character shorter than bbstreamer.

Before we proceed with the rest of this patch series, I think we
should fix up the comments for some of the astreamer files. Proposed
patch for that attached; please review.

I also noticed that cfbot was unhappy about this patch set:

[10:37:55.075] pg_verifybackup.c:100:7: error: no previous extern
declaration for non-static variable 'format'
[-Werror,-Wmissing-variable-declarations]
[10:37:55.075] char            format = '\0';          /* p(lain)/t(ar) */
[10:37:55.075]                 ^
[10:37:55.075] pg_verifybackup.c:100:1: note: declare 'static' if the
variable is not intended to be used outside of this translation unit
[10:37:55.075] char            format = '\0';          /* p(lain)/t(ar) */
[10:37:55.075] ^
[10:37:55.075] pg_verifybackup.c:101:23: error: no previous extern
declaration for non-static variable 'compress_algorithm'
[-Werror,-Wmissing-variable-declarations]
[10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE;
[10:37:55.075]                       ^
[10:37:55.075] pg_verifybackup.c:101:1: note: declare 'static' if the
variable is not intended to be used outside of this translation unit
[10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE;
[10:37:55.075] ^
[10:37:55.075] 2 errors generated.

Please fix and, after posting future versions of the patch set, try to
remember to check http://cfbot.cputube.org/amul-sul.html

--
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Mon, Aug 5, 2024 at 10:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Aug 2, 2024 at 7:43 AM Amul Sul <sulamul@gmail.com> wrote:
> > Please consider the attached version for the review.
>
> Thanks. I committed 0001-0003. The only thing that I changed was that
> in 0001, you forgot to pgindent, which actually mattered quite a bit,
> because astreamer is one character shorter than bbstreamer.
>

Understood. Thanks for tidying up and committing the patches.

> Before we proceed with the rest of this patch series, I think we
> should fix up the comments for some of the astreamer files. Proposed
> patch for that attached; please review.
>

Looks good to me, except for the following typo that I have fixed in
the attached version:

s/astreamer_plan_writer/astreamer_plain_writer/

> I also noticed that cfbot was unhappy about this patch set:
>
> [10:37:55.075] pg_verifybackup.c:100:7: error: no previous extern
> declaration for non-static variable 'format'
> [-Werror,-Wmissing-variable-declarations]
> [10:37:55.075] char            format = '\0';          /* p(lain)/t(ar) */
> [10:37:55.075]                 ^
> [10:37:55.075] pg_verifybackup.c:100:1: note: declare 'static' if the
> variable is not intended to be used outside of this translation unit
> [10:37:55.075] char            format = '\0';          /* p(lain)/t(ar) */
> [10:37:55.075] ^
> [10:37:55.075] pg_verifybackup.c:101:23: error: no previous extern
> declaration for non-static variable 'compress_algorithm'
> [-Werror,-Wmissing-variable-declarations]
> [10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE;
> [10:37:55.075]                       ^
> [10:37:55.075] pg_verifybackup.c:101:1: note: declare 'static' if the
> variable is not intended to be used outside of this translation unit
> [10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE;
> [10:37:55.075] ^
> [10:37:55.075] 2 errors generated.
>

Fixed in the attached version.

> Please fix and, after posting future versions of the patch set, try to
> remember to check http://cfbot.cputube.org/amul-sul.html

Sure. I used to rely on that earlier, but after Cirrus CI in the
GitHub repo, I assumed the workflow would be the same as cfbot and
started overlooking it. However, cfbot reported a warning that didn't
appear in my GitHub run. From now on, I'll make sure to check cfbot as
well.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Thu, Aug 1, 2024 at 9:19 AM Amul Sul <sulamul@gmail.com> wrote:
> > I think I would have made this pass context->show_progress to
> > progress_report() instead of the whole verifier_context, but that's an
> > arguable stylistic choice, so I'll defer to you if you prefer it the
> > way you have it. Other than that, this LGTM.
>
> Additionally, I moved total_size and done_size to verifier_context
> because done_size needs to be accessed in astreamer_verify.c.
> With  this change, verifier_context is now more suitable.

But it seems like 0006 now changes the logic for computing total_size.
Prepatch, the condition is:

-       if (context->show_progress && !context->skip_checksums &&
-               should_verify_checksum(m))
-               context->total_size += m->size;

where should_verify_checksum(m) checks (((m)->matched) && !((m)->bad)
&& (((m)->checksum_type) != CHECKSUM_TYPE_NONE)). But post-patch the
condition is:

+       if (!context.skip_checksums)
...
+               if (!should_ignore_relpath(context, m->pathname))
+                       total_size += m->size;

The old logic was reached from verify_backup_directory() which does
check should_ignore_relpath(), so the new condition hasn't added
anything. But it seems to have lost the show_progress condition, and
the m->checksum_type != CHECKSUM_TYPE_NONE condition. I think this
means that we'll sum the sizes even when not displaying progress, and
that if some of the files in the manifest had no checksums, our
progress reporting would compute wrong percentages after the patch.

> Understood. At the start of working on the v3 review, I thought of
> completely discarding the 0007 patch and copying most of
> verify_file_checksum() to a new function in astreamer_verify.c.
> However, I later realized we could deduplicate some parts, so I split
> verify_file_checksum() and moved the reusable part to a separate
> function. Please have a look at v4-0007.

Yeah, that seems OK.

The fact that these patches don't have commit messages is making life
more difficult for me than it needs to be. In particular, I'm looking
at 0009 and there's no hint about why you want to do this. In fact
that's the case for all of these refactoring patches. Instead of
saying something like "tar format verification will want to verify the
control file, but will not be able to read the file directly from
disk, so separate the logic that reads the control file from the logic
that verifies it" you just say what code you moved. Then I have to
guess why you moved it, or flip back and forth between the refactoring
patch and 0011 to try to figure it out. It would be nice if each of
these refactoring patches contained a clear indication about the
purpose of the refactoring in the commit message.

> I had the same thought about checking for NULL inside
> should_verify_control_data(), but I wanted to maintain the structure
> similar to should_verify_checksum(). Making this change would have
> also required altering should_verify_checksum(), I wasn’t sure if I
> should make that change before. Now, I did that in the attached
> version -- 0008 patch.

I believe there is no reason for this change to be part of 0008 at
all, and that this should be part of whatever later patch needs it.

> > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also.
>
> Done.

OK, the formatting of 0011 looks much better now.

It seems to me that 0011 is arranging to palloc the checksum context
for every file and then pfree it at the end. It seems like it would be
considerably more efficient if astreamer_verify contained a
pg_checksum_context instead of a pointer to a pg_checksum_context. If
you need a flag to indicate whether we've reinitialized the checksum
for the current file, it's better to add that than to have all of
these unnecessary allocate/free cycles.

Existing astreamer code uses struct member names_like_this. For the
new one, you mostly used namesLikeThis except when you used
names_like_this or namesLkThs.

It seems to me that instead of adding a global variable
verify_backup_file_cb, it would be better to move the 'format'
variable into verifier_context. Then you can do something like if
(context->format == 'p') verify_plain_backup_file() else
verify_tar_backup_file().

It's pretty common for .tar.gz to be abbreviated to .tgz. I think we
should support that.

Let's suppose that I have a backup which, for some reason, does not
use the same compression for all files (base.tar, 16384.tgz,
16385.tar.gz, 16366.tar.lz4). With this patch, that will fail. Now,
that's not really a problem, because having a backup with mixed
compression algorithms like that is strange and you probably wouldn't
try to do it. But on the other hand, it looks to me like making the
code support that would be more elegant than what you have now.
Because, right now, you have code to detect what type of backup you've
got by looking at base.WHATEVER_EXTENSION ... but then you have to
also have code that complains if some later file doesn't have the same
extension. But you could just detect the type of every file
individually.

In fact, I wonder if we even need -Z. What value is that actually
providing? Why not just always auto-detect?

find_backup_format() ignores the possibility of stat() throwing an
error. That's not good.

Suppose that the backup directory contains main.tar, 16385.tar, and
snuffleupagus.tar. It looks to me like what will happen here is that
we'll verify main.tar with tblspc_oid = InvalidOid, 16385.tar with
tblspc_oid = 16385, and snuffleupagus.tar with tblspc_oid =
InvalidOid. That doesn't sound right. I think we should either
completely ignore snuffleupagus.tar just as it were completely
imaginary, or perhaps there's an argument for emitting a warning
saying that we weren't expecting a snuffleupagus to exist.

In general, I think all unexpected files in a tar-format backup
directory should get the same treatment, regardless of whether the
problem is with the extension or the file itself. We should either
silently ignore everything that isn't expected to be present, or we
should emit a complaint saying that the file isn't expected to be
present. Right now, you say that it's "not a valid file" if the
extension isn't what you expect (which doesn't seem like a good error
message, because the file may be perfectly valid for what it is, it's
just not a file we're expecting to see) and say nothing if the
extension is right but the part of the filename preceding the
extension is unexpected.

A related issue is that it's a little unclear what --ignore is
supposed to do for tar-format backups. Does that ignore files in the
backup directory, or files instead of the tar files inside of the
backup directory? If we decide that --ignore ignores files in the
backup directory, then we should complain about any unexpected files
that are present there unless they've been ignored. If we decide that
--ignore ignores files inside of the tar files, then I suggest we just
silently skip any files in the backup directory that don't seem to
have file names in the correct format. I think I prefer the latter
approach, but I'm not 100% sure what's best.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Tue, Aug 6, 2024 at 10:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 9:19 AM Amul Sul <sulamul@gmail.com> wrote:
> > > I think I would have made this pass context->show_progress to
> > > progress_report() instead of the whole verifier_context, but that's an
> > > arguable stylistic choice, so I'll defer to you if you prefer it the
> > > way you have it. Other than that, this LGTM.
> >
> > Additionally, I moved total_size and done_size to verifier_context
> > because done_size needs to be accessed in astreamer_verify.c.
> > With  this change, verifier_context is now more suitable.
>
> But it seems like 0006 now changes the logic for computing total_size.
> Prepatch, the condition is:
>
> -       if (context->show_progress && !context->skip_checksums &&
> -               should_verify_checksum(m))
> -               context->total_size += m->size;
>
> where should_verify_checksum(m) checks (((m)->matched) && !((m)->bad)
> && (((m)->checksum_type) != CHECKSUM_TYPE_NONE)). But post-patch the
> condition is:
>
> +       if (!context.skip_checksums)
> ...
> +               if (!should_ignore_relpath(context, m->pathname))
> +                       total_size += m->size;
>
> The old logic was reached from verify_backup_directory() which does
> check should_ignore_relpath(), so the new condition hasn't added
> anything. But it seems to have lost the show_progress condition, and
> the m->checksum_type != CHECKSUM_TYPE_NONE condition. I think this
> means that we'll sum the sizes even when not displaying progress, and
> that if some of the files in the manifest had no checksums, our
> progress reporting would compute wrong percentages after the patch.
>

That is not true. The compute_total_size() function doesn't do
anything when not displaying progress, the first if condition, which
returns the same way as progress_report(). I omitted
should_verify_checksum() since we don't have match and bad flag
information at the start, and we won't have that for TAR files at all.
However, I missed the checksum_type check, which is necessary, and
have added it now.

With the patch, I am concerned that we won't be able to give an
accurate progress report as before. We add all the file sizes in the
backup manifest to the total_size without checking if they exist on
disk. Therefore, sometimes the reported progress completion might not
show 100% when we encounter files where m->bad or m->match == false at
a later stage. However, I think this should be acceptable since there
will be an error for the respective missing or bad file, and it can be
understood that verification is complete even if the progress isn't
100% in that case. Thoughts/Comments?


> > Understood. At the start of working on the v3 review, I thought of
> > completely discarding the 0007 patch and copying most of
> > verify_file_checksum() to a new function in astreamer_verify.c.
> > However, I later realized we could deduplicate some parts, so I split
> > verify_file_checksum() and moved the reusable part to a separate
> > function. Please have a look at v4-0007.
>
> Yeah, that seems OK.
>
> The fact that these patches don't have commit messages is making life
> more difficult for me than it needs to be. In particular, I'm looking
> at 0009 and there's no hint about why you want to do this. In fact
> that's the case for all of these refactoring patches. Instead of
> saying something like "tar format verification will want to verify the
> control file, but will not be able to read the file directly from
> disk, so separate the logic that reads the control file from the logic
> that verifies it" you just say what code you moved. Then I have to
> guess why you moved it, or flip back and forth between the refactoring
> patch and 0011 to try to figure it out. It would be nice if each of
> these refactoring patches contained a clear indication about the
> purpose of the refactoring in the commit message.
>

Sorry, I was a bit lazy there, assuming you'd handle the review :).
I can understand the frustration -- added some description.

> > I had the same thought about checking for NULL inside
> > should_verify_control_data(), but I wanted to maintain the structure
> > similar to should_verify_checksum(). Making this change would have
> > also required altering should_verify_checksum(), I wasn’t sure if I
> > should make that change before. Now, I did that in the attached
> > version -- 0008 patch.
>
> I believe there is no reason for this change to be part of 0008 at
> all, and that this should be part of whatever later patch needs it.
>

Ok

> > > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also.
> >
> > Done.
>
> OK, the formatting of 0011 looks much better now.
>
> It seems to me that 0011 is arranging to palloc the checksum context
> for every file and then pfree it at the end. It seems like it would be
> considerably more efficient if astreamer_verify contained a
> pg_checksum_context instead of a pointer to a pg_checksum_context. If
> you need a flag to indicate whether we've reinitialized the checksum
> for the current file, it's better to add that than to have all of
> these unnecessary allocate/free cycles.
>

I tried in the attached version, and it’s a good improvement. We don’t
need any flags; we can allocate that during astreamer creation. Later,
in the ASTREAMER_MEMBER_HEADER case while reading, we can
(re)initialize the context for each file as needed.

> Existing astreamer code uses struct member names_like_this. For the
> new one, you mostly used namesLikeThis except when you used
> names_like_this or namesLkThs.
>

Yeah, in my patch, I ended up using the same name for both the
variable and the function. To avoid that, I made this change. This
could be a minor inconvenience for someone using ctags/cscope to find
the definition of the function or variable, as they might be directed
to the wrong place. However, I think it’s still okay since there are
ways to find the correct definition. I reverted those changes in the
attached version.

> It seems to me that instead of adding a global variable
> verify_backup_file_cb, it would be better to move the 'format'
> variable into verifier_context. Then you can do something like if
> (context->format == 'p') verify_plain_backup_file() else
> verify_tar_backup_file().
>

Done.

> It's pretty common for .tar.gz to be abbreviated to .tgz. I think we
> should support that.
>

Done.

> Let's suppose that I have a backup which, for some reason, does not
> use the same compression for all files (base.tar, 16384.tgz,
> 16385.tar.gz, 16366.tar.lz4). With this patch, that will fail. Now,
> that's not really a problem, because having a backup with mixed
> compression algorithms like that is strange and you probably wouldn't
> try to do it. But on the other hand, it looks to me like making the
> code support that would be more elegant than what you have now.
> Because, right now, you have code to detect what type of backup you've
> got by looking at base.WHATEVER_EXTENSION ... but then you have to
> also have code that complains if some later file doesn't have the same
> extension. But you could just detect the type of every file
> individually.
>
> In fact, I wonder if we even need -Z. What value is that actually
> providing? Why not just always auto-detect?
>

+1, removed  -Z option.

> find_backup_format() ignores the possibility of stat() throwing an
> error. That's not good.
>

I wasn't sure about that before -- I tried it in the attached version.
See if it looks good to you.

> Suppose that the backup directory contains main.tar, 16385.tar, and
> snuffleupagus.tar. It looks to me like what will happen here is that
> we'll verify main.tar with tblspc_oid = InvalidOid, 16385.tar with
> tblspc_oid = 16385, and snuffleupagus.tar with tblspc_oid =
> InvalidOid. That doesn't sound right. I think we should either
> completely ignore snuffleupagus.tar just as it were completely
> imaginary, or perhaps there's an argument for emitting a warning
> saying that we weren't expecting a snuffleupagus to exist.
>
> In general, I think all unexpected files in a tar-format backup
> directory should get the same treatment, regardless of whether the
> problem is with the extension or the file itself. We should either
> silently ignore everything that isn't expected to be present, or we
> should emit a complaint saying that the file isn't expected to be
> present. Right now, you say that it's "not a valid file" if the
> extension isn't what you expect (which doesn't seem like a good error
> message, because the file may be perfectly valid for what it is, it's
> just not a file we're expecting to see) and say nothing if the
> extension is right but the part of the filename preceding the
> extension is unexpected.
>

I added an error for files other than base.tar and
<tablespacesoid>.tar. I think the error message could be improved.

> A related issue is that it's a little unclear what --ignore is
> supposed to do for tar-format backups. Does that ignore files in the
> backup directory, or files instead of the tar files inside of the
> backup directory? If we decide that --ignore ignores files in the
> backup directory, then we should complain about any unexpected files
> that are present there unless they've been ignored. If we decide that
> --ignore ignores files inside of the tar files, then I suggest we just
> silently skip any files in the backup directory that don't seem to
> have file names in the correct format. I think I prefer the latter
> approach, but I'm not 100% sure what's best.
>

I am interested in having that feature to be as useful as possible --
I mean, allowing the option to ignore files from the backup directory
and from the archive file as well. I don't see any major drawbacks,
apart from spending extra CPU cycles to browse the ignore list.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
[ I committed 0001, then noticed I had a type in the subject line of
the commit message. Argh. ]

On Wed, Aug 7, 2024 at 9:41 AM Amul Sul <sulamul@gmail.com> wrote:
> With the patch, I am concerned that we won't be able to give an
> accurate progress report as before. We add all the file sizes in the
> backup manifest to the total_size without checking if they exist on
> disk. Therefore, sometimes the reported progress completion might not
> show 100% when we encounter files where m->bad or m->match == false at
> a later stage. However, I think this should be acceptable since there
> will be an error for the respective missing or bad file, and it can be
> understood that verification is complete even if the progress isn't
> 100% in that case. Thoughts/Comments?

When somebody says that something is a refactoring commit, my
assumption is that there should be no behavior change. If the behavior
is changing, it's not purely a refactoring, and it shouldn't be
labelled as a refactoring (or at least there should be a prominent
disclaimer identifying whatever behavior has changed, if a small
change was deemed acceptable and unavoidable).

I am very reluctant to accept a functional regression of the type that
you describe here (or the type that I postulated might occur, even if
I was wrong and it doesn't). The point here is that we're trying to
reuse the code, and I support that goal, because code reuse is good.
But it's not such a good thing that we should do it if it has negative
consequences. We should either figure out some other way of
refactoring it that doesn't have those negative side-effects, or we
should leave the existing code alone and have separate code for the
new stuff we want to do.

I do realize that the type of side effect you describe here is quite
minor. I could live with it if it were unavoidable. But I really don't
see why we can't avoid it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Wed, Aug 7, 2024 at 9:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> [ I committed 0001, then noticed I had a type in the subject line of
> the commit message. Argh. ]
>
> On Wed, Aug 7, 2024 at 9:41 AM Amul Sul <sulamul@gmail.com> wrote:
> > With the patch, I am concerned that we won't be able to give an
> > accurate progress report as before. We add all the file sizes in the
> > backup manifest to the total_size without checking if they exist on
> > disk. Therefore, sometimes the reported progress completion might not
> > show 100% when we encounter files where m->bad or m->match == false at
> > a later stage. However, I think this should be acceptable since there
> > will be an error for the respective missing or bad file, and it can be
> > understood that verification is complete even if the progress isn't
> > 100% in that case. Thoughts/Comments?
>
> When somebody says that something is a refactoring commit, my
> assumption is that there should be no behavior change. If the behavior
> is changing, it's not purely a refactoring, and it shouldn't be
> labelled as a refactoring (or at least there should be a prominent
> disclaimer identifying whatever behavior has changed, if a small
> change was deemed acceptable and unavoidable).
>

I agree; I'll be more careful next time.

> I am very reluctant to accept a functional regression of the type that
> you describe here (or the type that I postulated might occur, even if
> I was wrong and it doesn't). The point here is that we're trying to
> reuse the code, and I support that goal, because code reuse is good.
> But it's not such a good thing that we should do it if it has negative
> consequences. We should either figure out some other way of
> refactoring it that doesn't have those negative side-effects, or we
> should leave the existing code alone and have separate code for the
> new stuff we want to do.
>
> I do realize that the type of side effect you describe here is quite
> minor. I could live with it if it were unavoidable. But I really don't
> see why we can't avoid it.
>

The main issue I have is computing the total_size of valid files that
will be checksummed and that exist in both the manifests and the
backup, in the case of a tar backup. This cannot be done in the same
way as with a plain backup.

Another consideration is that, in the case of a tar backup, since
we're reading all tar files from the backup rather than individual
files to be checksummed, should we consider total_size as the sum of
all valid tar files in the backup, regardless of checksum
verification? If so, we would need to perform an initial pass to
calculate the total_size in the directory, similar to what
verify_backup_directory() does., but doing so I am a bit uncomfortable
and unsure if we should do that pass.

An alternative idea is to provide progress reports per file instead of
for the entire backup directory. We could report the size of each file
and keep track of done_size as we read each tar header and content.
For example, the report could be:

109032/109032 kB (100%) base.tar file verified
123444/123444 kB (100%) 16245.tar file verified
23478/23478 kB (100%) 16246.tar file verified
.....
<total_done_size>/<total_size>  (NNN%) verified.

I think this type of reporting can be implemented with minimal
changes, abd for plain backups, we can keep the reporting as it is.
Thoughts?

Regards,
Amul



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Wed, Aug 7, 2024 at 1:05 PM Amul Sul <sulamul@gmail.com> wrote:
> The main issue I have is computing the total_size of valid files that
> will be checksummed and that exist in both the manifests and the
> backup, in the case of a tar backup. This cannot be done in the same
> way as with a plain backup.

I think you should compute and sum the sizes of the tar files
themselves. Suppose you readdir(), make a list of files that look
relevant, and stat() each one. total_size is the sum of the file
sizes. Then you work your way through the list of files and read each
one. done_size is the total size of all files you've read completely
plus the number of bytes you've read from the current file so far.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Amul Sul
Дата:
On Wed, Aug 7, 2024 at 11:28 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 7, 2024 at 1:05 PM Amul Sul <sulamul@gmail.com> wrote:
> > The main issue I have is computing the total_size of valid files that
> > will be checksummed and that exist in both the manifests and the
> > backup, in the case of a tar backup. This cannot be done in the same
> > way as with a plain backup.
>
> I think you should compute and sum the sizes of the tar files
> themselves. Suppose you readdir(), make a list of files that look
> relevant, and stat() each one. total_size is the sum of the file
> sizes. Then you work your way through the list of files and read each
> one. done_size is the total size of all files you've read completely
> plus the number of bytes you've read from the current file so far.
>

I tried this in the attached version and made a few additional changes
based on Sravan's off-list comments regarding function names and
descriptions.

Now, verification happens in two passes. The first pass simply
verifies the file names, determines their compression types, and
returns a list of valid tar files whose contents need to be verified
in the second pass. The second pass is called at the end of
verify_backup_directory() after all files in that directory have been
scanned. I named the functions for pass 1 and pass 2 as
verify_tar_file_name() and verify_tar_file_contents(), respectively.
The rest of the code flow is similar as in the previous version.

In the attached patch set, I abandoned the changes, touching the
progress reporting code of plain backups by dropping the previous 0009
patch. The new 0009 patch adds missing APIs to simple_list.c to
destroy SimplePtrList. The rest of the patch numbers remain unchanged.

Regards,
Amul

Вложения

Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Mon, Aug 12, 2024 at 5:13 AM Amul Sul <sulamul@gmail.com> wrote:
> I tried this in the attached version and made a few additional changes
> based on Sravan's off-list comments regarding function names and
> descriptions.
>
> Now, verification happens in two passes. The first pass simply
> verifies the file names, determines their compression types, and
> returns a list of valid tar files whose contents need to be verified
> in the second pass. The second pass is called at the end of
> verify_backup_directory() after all files in that directory have been
> scanned. I named the functions for pass 1 and pass 2 as
> verify_tar_file_name() and verify_tar_file_contents(), respectively.
> The rest of the code flow is similar as in the previous version.
>
> In the attached patch set, I abandoned the changes, touching the
> progress reporting code of plain backups by dropping the previous 0009
> patch. The new 0009 patch adds missing APIs to simple_list.c to
> destroy SimplePtrList. The rest of the patch numbers remain unchanged.

I think you've entangled the code paths here for plain-format backup
and tar-format backup in a way that is not very nice. I suggest
refactoring things so that verify_backup_directory() is only called
for plain-format backups, and you have some completely separate
function (perhaps verify_tar_backup) that is called for tar-format
backups. I don't think verify_backup_file() should be shared between
tar-format and plain-format backups either. Let that be just for
plain-format backups, and have separate logic for tar-format backups.
Right now you've got "if" statements in various places trying to get
all the cases correct, but I think you've missed some (and there's
also the issue of updating all the comments).

For instance, verify_backup_file() recurses into subdirectories, but
that behavior is inappropriate for a tar format backup, where
subdirectories should instead be treated like stray files: complain
that they exist. pg_verify_backup() does this:

    /* If it's a directory, just recurse. */
    if (S_ISDIR(sb.st_mode))
    {
        verify_backup_directory(context, relpath, fullpath);
        return;
    }

    /* If it's not a directory, it should be a plain file. */
    if (!S_ISREG(sb.st_mode))
    {
        report_backup_error(context,
                            "\"%s\" is not a file or directory",
                            relpath);
        return;
    }

For a plain format backup, this whole thing should be just:

    /* In a tar format backup, we expect only plain files. */
    if (!S_ISREG(sb.st_mode))
    {
        report_backup_error(context,
                            "\"%s\" is not a plain file",
                            relpath);
        return;
    }

Also, immediately above, you do
simple_string_list_append(&context->ignore_list, relpath), but that is
pointless in the tar-file case, and arguably wrong, if -i is going to
ignore both pathnames in the base directory and also pathnames inside
the tar files, because we could add something to the ignore list here
-- accomplishing nothing useful -- and then that ignore-list entry
could cause us to disregard a stray file with the same name present
inside one of the tar files -- which is silly. Note that the actual
point of this logic is to make sure that if we can't stat() a certain
directory, we don't go off and issue a million complaints about all
the files in that directory being missing. But this doesn't accomplish
that goal for a tar-format backup. For a tar-format backup, you'd want
to figure out which files in the manifest we don't expect to see based
on this file being inaccessible, and then arrange to suppress future
complaints about all of those files. But you can't implement that
here, because you haven't parsed the file name yet. That happens
later, in verify_tar_file_name().

You could add a whole bunch more if statements here and try to work
around these issues, but it seems pretty obviously a dead end to me.
Almost the entire function is going to end up being inside of an
if-statement. Essentially the only thing in verify_backup_file() that
should actually be the same in the plain and tar-format cases is that
you should call stat() either way and check whether it throws an
error. But that is not enough justification for trying to share the
whole function.

I find the logic in verify_tar_file_name() to be somewhat tortured as
well. The strstr() calls will match those strings anywhere in the
filename, not just at the end. But also, even if you fixed that, why
work backward from the end of the filename toward the beginning? It
would seem like it would more sense to (1) first check if the string
starts with "base" and set suffix equal to pathname+4, (2) if not,
strtol(pathname, &suffix, 10) and complain if we didn't eat at least
one character or got 0 or something too big to be an OID, (3) check
whether suffix is .tar, .tar.gz, etc.

In verify_member_checksum(), you set mystreamer->verify_checksum =
false. That would be correct if there could only ever be one call to
verify_member_checksum() per member file, but there is no such rule.
There can be (and, I think, normally will be) more than one
ASTREAMER_MEMBER_CONTENTS chunk. I'm a little confused as to how this
code passes any kind of testing.

Also in verify_member_checksum(), the mystreamer->received_bytes <
m->size seems strange. I don't think this is the right way to do
something when you reach the end of an archive member. The right way
to do that is to do it when the ASTREAMER_MEMBER_TRAILER chunk shows
up.

In verify_member_control_data(), you use astreamer_buffer_untIl(). But
that's the same buffer that is being used by verify_member_checksum(),
so I don't really understand how you expect this to work. If this code
path were ever taken, verify_member_checksum() would see the same data
more than once.

The call to pg_log_debug() in this function also seems quite random.
In a plain-format backup, we'd actually be doing something different
for pg_controldata vs. other files, namely reading it during the
initial directory scan. But here we're reading the file in exactly the
same sense as we're reading every other file, neither more nor less,
so why mention this file and not all of the others? And why at this
exact spot in the code?

I suspect that the report_fatal_error("%s: could not read control
file: read %d of %zu", ...) call is unreachable. I agree that you need
to handle the case where the control file inside the tar file is not
the expected length, and in fact I think you should probably write a
TAP test for that exact scenario to make sure it works. I bet this
doesn't. Even if it did, the error message makes no sense in context.
In the plain-format backup, this error would come from code reading
the actual bytes off the disk -- i.e. the complaint about not being
able to read the control file would come from the read() system call.
Here it doesn't.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Wed, Aug 14, 2024 at 9:20 AM Amul Sul <sulamul@gmail.com> wrote:
> I agree with keeping verify_backup_file() separate, but I'm hesitant
> about doing the same for verify_backup_directory().

I don't have time today to go through your whole email or re-review
the code, but I plan to circle back to that at a later time, However,
I want to respond to this point in the meanwhile. There are two big
things that are different for a tar-format backup vs. a
directory-format backup as far as verify_backup_directory() is
concerned. One is that, for a directory format backup, we need to be
able to recurse down through subdirectories; for tar-format backups we
don't. So a version of this function that only handled tar-format
backups would be somewhat shorter and simpler, and would need one
fewer argument. The second difference is that for the tar-format
backup, you need to make a list of the files you see and then go back
and visit each one a second time, and for a directory-format backup
you don't need to do that. It seems to me that those differences are
significant enough to warrant having two separate functions. If you
unify them, I think that less than half of the resulting function is
going to be common to both cases. Yeah, a few bits of logic will be
duplicated, like the error handling for closedir(), the logic to skip
"." and "..", and the psprintf() to construct a full pathname for the
directory entry. But that isn't really very much code, and it's code
that is pretty straightforward and also present in various other
places in the PostgreSQL source tree, perhaps not in precisely the
same form. The fact that two functions both call readdir() and do
something with each file in the directory isn't enough to say that
they should be the same function, IMHO.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Wed, Aug 14, 2024 at 12:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 14, 2024 at 9:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > I agree with keeping verify_backup_file() separate, but I'm hesitant
> > about doing the same for verify_backup_directory().
>
> I don't have time today to go through your whole email or re-review
> the code, but I plan to circle back to that at a later time, However,
> I want to respond to this point in the meanwhile.

I have committed 0004 (except that I changed a comment) and 0005
(except that I didn't move READ_CHUNK_SIZE).

Looking at the issue mentioned above again, I agree that the changes
in verify_backup_directory() in this version don't look overly
invasive in this version. I'm still not 100% convinced it's the right
call, but it doesn't seem bad.

You have a spurious comment change to the header of verify_plain_backup_file().

I feel like the naming of tarFile and tarFiles is not consistent with
the overall style of this file.

I don't like this:

[robert.haas ~]$ pg_verifybackup btar
pg_verifybackup: error: pg_waldump does not support parsing WAL files
from a tar archive.
pg_verifybackup: hint: Try "pg_verifybackup --help" to skip parse WAL
files option.

The hint seems like it isn't really correct grammar, and I also don't
see why we can't be more specific. How about "You must use -n,
--no-parse-wal when verifying a tar-format backup."?

The primary message seems a bit redundant, because parsing WAL files
is the only thing pg_waldump does. How about "pg_waldump cannot read
from a tar archive"? Note that primary error messages should not end
in a period (while hint and detail messages should).

+        int64 num = strtoi64(relpath, &suffix, 10);



+        if (suffix == NULL || (num <= 0) || (num > OID_MAX))

Seems like too many parentheses.



--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Fri, Aug 16, 2024 at 3:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> +        int64 num = strtoi64(relpath, &suffix, 10);

Hit send too early. Here, seems like this should be strtoul(), not strtoi64().

The documentation of --format seems to be cut-and-pasted from
pg_basebackup and the language isn't really appropriate here. e.g.
"The main data directory's contents will be written to a file
named..." but pg_verifybackup writes nothing.

+    simple_string_list_append(&context.ignore_list, "pg_wal.tar");
+    simple_string_list_append(&context.ignore_list, "pg_wal.tar.gz");
+    simple_string_list_append(&context.ignore_list, "pg_wal.tar.lz4");
+    simple_string_list_append(&context.ignore_list, "pg_wal.tar.zst");

Why not make the same logic that recognizes base or an OID also
recognize pg_wal as a prefix, and identify that as the WAL archive?
For now we'll have to skip it, but if you do it that way then if we
add future support for more suffixes, it'll just work, whereas this
way won't. And you'd need that code anyway if we ever can run
pg_waldump on a tarfile, because you would need to identify the
compression method. Note that the danger of the list of suffixes
getting out of sync here is not hypothetical: you added .tgz elsewhere
but not here.

There's probably more to look at here but I'm running out of energy for today.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
On Wed, Aug 21, 2024 at 7:08 AM Amul Sul <sulamul@gmail.com> wrote:
> I have reworked a few comments, revised error messages, and made some
> minor tweaks in the attached version.

Thanks.

> Additionally, I would like to discuss a couple of concerns regarding
> error placement and function names to gather your suggestions.
>
> 0007 patch: Regarding error placement:
>
> 1. I'm a bit unsure about the (bytes_read != m->size) check that I
> placed in verify_checksum() and whether it's in the right place. Per
> our previous discussion, this check is applicable to plain backup
> files since they can change while being read, but not for files
> belonging to tar backups. For consistency, I included the check for
> tar backups as well, as it doesn't cause any harm. Is it okay to keep
> this check in verify_checksum(), or should I move it back to
> verify_file_checksum() and apply it only to the plain backup format?

I think it's a good sanity check. For a long time I thought it was
triggerable until I eventually realized that you just get this message
if the file size is wrong:

pg_verifybackup: error: "pg_xact/0000" has size 8203 on disk but size
8192 in the manifest

After realizing that, I agree with you that this doesn't really seem
reachable for tar backups, but I don't think it hurts anything either.

While I was investigating this question, I discovered this problem:

$ pg_basebackup -cfast -Ft -Dx
$ pg_verifybackup -n x
backup successfully verified
$ mkdir x/tmpdir
$ tar -C x/tmpdir -xf x/base.tar
$ rm x/base.tar
$ tar -C x/tmpdir -cf x/base.tar .
$ pg_verifybackup -n x
<lots of errors>

It appears that the reason why this fails is that the paths in the
original base.tar from the server do not include "./" at the
beginning, and the ones that I get when I create my own tarfile have
that. But that shouldn't matter. Anyway, I was able to work around it
like this:

$ tar -C x/tmpdir -cf x/base.tar `(cd x/tmpdir; echo *)`

Then the result verifies. But I feel like we should have some test
cases that do this kind of stuff so that there is automated
verification. In fact, the current patch seems to have no negative
test cases at all. I think we should test all the cases in
003_corruption.pl with tar format backups as well as with plain format
backups, which we could do by untarring one of the archives, messing
something up, and then retarring it. I also think we should have some
negative test case specific to tar-format backup. I suggest running a
coverage analysis and trying to craft test cases that hit as much of
the code as possible. There will probably be some errors you can't
hit, but you should try to hit the ones you can.

> 2. For the verify_checksum() function, I kept the argument name as
> bytes_read. Should we rename it to something more meaningful like
> computed_bytes, computed_size, or checksum_computed_size?

I think it's fine the way you have it.

> 0011 patch: Regarding function names:
>
> 1. named the function verify_tar_backup_file() to align with
> verify_plain_backup_file(), but it does not perform the complete
> verification as verify_plain_backup_file does. Not sure if it is the
> right name.

I was thinking of something like precheck_tar_backup_file().

> 2. verify_tar_file_contents() is the second and final part of tar
> backup verification. Should its name be aligned with
> verify_tar_backup_file()? I’m unsure what the best name would be.
> Perhaps verify_tar_backup_file_final(), but then
> verify_tar_backup_file() would need to be renamed to something like
> verify_tar_backup_file_initial(), which might be too lengthy.

verify_tar_file_contents() actually verifies the contents of all the
tar files, not just one, so the name is a bit confusing. Maybe
verify_all_tar_files().

> 3. verify_tar_contents() is the core of verify_tar_file_contents()
> that handles the actual verification. I’m unsure about the current
> naming. Should we rename it to something like
> verify_tar_contents_core()? It wouldn’t be an issue if we renamed
> verify_tar_file_contents() as pointed in point #2.

verify_one_tar_file()?

But with those renames, I think you really start to see why I'm not
very comfortable with verify_backup_directory(). The tar and plain
format cases aren't really doing the same thing. We're just gluing
them into a single function anyway.

I am also still uncomfortable with the way you've refactored some of
this so that we end up with very small amounts of code far away from
other code that they influence. Like you end up with this:

        /* Check the backup manifest entry for this file. */
        m = verify_manifest_entry(context, relpath, sb.st_size);

        /* Validate the pg_control information */
        if (should_verify_control_data(context->manifest, m))
...
        if (show_progress && !context->skip_checksums &&
                should_verify_checksum(m))

But verify_manifest_entry can return NULL or it can set m->bad and
either of those change the result of should_verify_control_data() and
should_verify_checksum(), but none of that is obvious when you just
look at this. Admittedly, the code in master isn't brilliant in terms
of making it obvious what's happening either, but I think this is
worse. I'm not really sure what I think we should do about that yet,
but I'm uncomfortable with it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
I would rather that you didn't add simple_ptr_list_destroy_deep()
given that you don't need it for this patch series.

+
"\"%s\" unexpected file in the tar format backup",

This doesn't seem grammatical to me. Perhaps change this to: file
\"%s\" is not expected in a tar format backup

+       /* We are only interested in files that are not in the ignore list. */
+       if (member->is_directory || member->is_link ||
+               should_ignore_relpath(mystreamer->context, member->pathname))
+               return;

Doesn't this need to happen after we add pg_tblspc/$OID to the path,
rather than before? I bet this doesn't work correctly for files in
user-defined tablespaces, compared to the way it work for a
directory-format backup.

+               char            temp[MAXPGPATH];
+
+               /* Copy original name at temporary space */
+               memcpy(temp, member->pathname, MAXPGPATH);
+
+               snprintf(member->pathname, MAXPGPATH, "%s/%d/%s",
+                                "pg_tblspc", mystreamer->tblspc_oid, temp);

I don't like this at all. This function doesn't have any business
modifying the astreamer_member, and it doesn't need to. It can just do
char *pathname; char tmppathbuf[MAXPGPATH] and then set pathname to
either member->pathname or tmppathbuf depending on
OidIsValid(tblspcoid). Also, shouldn't this be using %u, not %d?

+       mystreamer->mfile = (void *) m;

Either the cast to void * isn't necessary, or it indicates that
there's a type mismatch that should be fixed.

+        * We could have these checks while receiving contents. However, since
+        * contents are received in multiple iterations, this would result in
+        * these lengthy checks being performed multiple times. Instead, setting
+        * flags here and using them before proceeding with verification will be
+        * more efficient.

Seems unnecessary to explain this.

+       Assert(mystreamer->verify_checksum);
+
+       /* Should have came for the right file */
+       Assert(strcmp(member->pathname, m->pathname) == 0);
+
+       /*
+        * The checksum context should match the type noted in the backup
+        * manifest.
+        */
+       Assert(checksum_ctx->type == m->checksum_type);

What do you think about:

Assert(m != NULL && !m->bad);
Assert(checksum_ctx->type == m->checksum_type);
Assert(strcmp(member->pathname, m->pathname) == 0);

Or possibly change the first one to Assert(should_verify_checksum(m))?

+       memcpy(&control_file, streamer->bbs_buffer.data,
sizeof(ControlFileData));

This probably doesn't really hurt anything, but it's a bit ugly. You
first use astreamer_buffer_until() to force the entire file into a
buffer. And then here, you copy the entire file into a second buffer
which is exactly the same except that it's guaranteed to be properly
aligned. It would be possible to include a ControlFileData in
astreamer_verify and copy the bytes directly into it (you'd need a
second astreamer_verify field for the number of bytes already copied
into that structure). I'm not 100% sure that's worth the code, but it
seems like it wouldn't take more than a few lines, so perhaps it is.

+/*
+ * Verify plain backup.
+ */
+static void
+verify_plain_backup(verifier_context *context)
+{
+       Assert(context->format == 'p');
+       verify_backup_directory(context, NULL, context->backup_directory);
+}
+

This seems like a waste of space.

+verify_tar_backup(verifier_context *context)

I like this a lot better now! I'm still not quite sure about the
decision to have the ignore list apply to both the backup directory
and the tar file contents -- but given the low participation on this
thread, I don't think we have much chance of getting feedback from
anyone else right now, so let's just do it the way you have it and we
can change it later if someone shows up to complain.

+verify_all_tar_files(verifier_context *context, SimplePtrList *tarfiles)

I think this code could be moved into its only caller instead of
having a separate function. And then if you do that, maybe
verify_one_tar_file could be renamed to just verify_tar_file. Or
perhaps that function could also be removed and just move the code
into the caller. It's not much code and not very deeply nested.
Similarly create_archive_verifier() could be considered for this
treatment. Maybe inlining all of these is too much and the result will
look messy, but I think we should at least try to combine some of
them.

...Robert



Re: pg_verifybackup: TAR format backup verification

От
Robert Haas
Дата:
+                       pg_log_error("pg_waldump cannot read from a tar");

"tar" isn't normally used as a noun as you do here, so I think this
should say "pg_waldump cannot read tar files".

Technically, the position of this check could lead to an unnecessary
failure, if -n wasn't specified but pg_wal.tar(.whatever) also doesn't
exist on disk. But I think it's OK to ignore that case.

However, I also notice this change to the TAP tests in a few places:

-       [ 'pg_verifybackup', $tempdir ],
+       [ 'pg_verifybackup', '-Fp', $tempdir ],

It's not the end of the world to have to make a change like this, but
it seems easy to do better. Just postpone the call to
find_backup_format() until right after we call parse_manifest_file().
That also means postponing the check mentioned above until right after
that, but that's fine: after parse_manifest_file() and then
find_backup_format(), you can do if (!no_parse_wal && context.format
== 't') { bail out }.

+       if (stat(path, &sb) == 0)
+               result = 'p';
+       else
+       {
+               if (errno != ENOENT)
+               {
+                       pg_log_error("cannot determine backup format : %m");
+                       pg_log_error_hint("Try \"%s --help\" for more
information.", progname);
+                       exit(1);
+               }
+
+               /* Otherwise, it is assumed to be a tar format backup. */
+               result = 't';
+       }

This doesn't look good, for a few reasons:

1. It would be clearer to structure this as if (stat(...) == 0) result
= 'p'; else if (errno == ENOENT) result = 't'; else { report an error;
} instead of the way you have it.

2. "cannot determine backup format" is not an appropriate way to
report the failure of stat(). The appropriate message is "could not
stat file \"%s\"".

3. It is almost never correct to put a space before a colon in an error message.

4. The hint doesn't look helpful, or necessary. I think you can just
delete that.

Regarding both point #2 and point #4, I think we should ask ourselves
how stat() could realistically fail here. On my system (macOS), the
document failed modes for stat() are EACCES (i.e. permission denied),
EFAULT (i.e. we have a bug in pg_verifybackup), EIO (I/O Error), ELOOP
(symlink loop), ENAMETOOLONG, ENOENT, ENOTDIR, and EOVERFLOW. In none
of those cases does it seem likely that specifying the format manually
will help anything. Thus, suggesting that the user look at the help,
presumably to find --format, is unlikely to solve anything, and
telling them that the error happened while trying to determine the
backup format isn't really helping anything, either. What the user
needs to know is that it was stat() that failed, and the pathname for
which it failed. Then they need to sort out whatever problem is
causing them to get one of those really weird errors.

Aside from the above, 0010 looks pretty reasonable, although I'll
probably want to do some wordsmithing on some of the comments at some
point.

-   of the backup.  The backup must be stored in the "plain"
-   format; a "tar" format backup can be checked after extracting it.
+   of the backup.  The backup must be stored in the "plain" or "tar"
+   format.  Verification is supported for <literal>gzip</literal>,
+   <literal>lz4</literal>, and  <literal>zstd</literal> compressed tar backup;
+   any other compressed format backups can be checked after decompressing them.

I don't think that we need to say that the backup must be stored in
the plain or tar format, because those are the only backup formats
pg_basebackup knows about. Similarly, it doesn't seem help to me to
enumerate all the compression algorithms that pg_basebackup supports
and say we only support those; what else would a user expect?

What I would do is replace the original sentence ("The backup must be
stored...") with: The backup may be stored either in the "plain" or
the "tar" format; this includes "tar" backups compressed with any
algorithm supported by pg_basebackup. However, at present, WAL
verification is supported only for plain-format backups. Therefore, if
the backup is stored in "tar" format, the <literal>-n,
--no-parse-wal<literal> option should be used.

+               # Add tar backup format option
+               push @backup, ('-F', 't');
+               # Add switch to skip WAL verification.
+               push @verify, ('-n');

Say why, not what. The second comment should say something like "WAL
verification not yet supported for tar-format backups".

+                       "$format backup fails with algorithm \"$algorithm\"");
+       $primary->command_ok(\@backup, "$format backup ok with
algorithm \"$algorithm\"");
+               ok(-f "$backup_path/backup_manifest", "$format backup
manifest exists");
+               "verify $format backup with algorithm \"$algorithm\"");

Personally I would change "$format" to "$format format" in all of
these places, so that we talk about a "tar format backup" or a "plain
format backup" instead of a "tar backup" or a "plain backup".

+               'skip_on_windows' => 1

I don't understand why 4 of the 7 new tests are skipped on Windows.
The existing "skip" message for this parameter says "unix-style
permissions not supported on Windows" but that doesn't seem applicable
for any of the new cases, and I couldn't find a comment about it,
either.

+       my @files = glob("*");
+       system_or_bail($tar, '-cf', "$backup_path/$archive", @files);

Why not just system_or_bail($tar, '-cf', "$backup_path/$archive", '.')?

Also, instead of having separate entries in the test array to do
basically the same thing on Windows, could we just iterate through the
test array twice and do everything once for plain format and then a
second time for tar format, and do the tests once for each? I don't
think that idea QUITE works, because the open_file_fails,
open_directory_fails, and search_directory_fails tests are really not
applicable to tar format. But we could rename skip_on_windows to
tests_file_permissions and skip those both on Windows and for tar
format. But aside from that, I don't quite see why it makes sense to,
for example, test extra_file for both formats but not
extra_tablespace_file, and indeed it seems like an important bit of
test coverage.

I also feel like we should have tests someplace that add extra files
to a tar-format backup in the backup directory (e.g. 1234567.tar, or
wug.tar, or 123456.wug) or remove entire files.

...Robert