Обсуждение: pg_waldump: support decoding of WAL inside tarfile
Hi All, Attaching patch to support a new feature that let pg_waldump decode WAL files directly from a tar archive. This was worked to address a limitation in pg_verifybackup[1], which couldn't parse WAL files from tar-formatted backups. The implementation will align with pg_waldump's existing xlogreader design, which uses three callback functions to manage WAL segments: open, read, and close. For tar archives, however, the approach will be simpler. Instead of using separate callbacks for opening and closing, the tar archive will be opened once at the start and closed explicitly at the end. The core logic will be in the WAL page reading callback. When xlogreader requests a new WAL page, this callback will be invoked. It will then call the archive streamer routine to read the WAL data from the tar archive into a buffer. This data will then be copied into xlogreader's own buffer, completing the read. Essentially, this is plumbing work: the new code will be responsible for getting WAL data from the tar archive and feeding it to the existing xlogreader. All other WAL page and record decoding logic, which is already robust within xlogreader, will be reused as is. This feature is being implemented in a series of patches as: - Refactoring: The first few patches (0001-0004) are dedicated to refactoring and minor code changes. - 005: This patch introduces the core functionality for pg_waldump to read WAL from a tar archive using the same archive streamer (fe_utils/astreamer.h) used in pg_verifybackup. This version requires WAL files in the archive to be in sequential order. - 006: This patch removes the sequential order restriction. If pg_waldump encounters an out-of-order WAL file, it writes the file to a temporary directory. The utility will then continue decoding and read from this temporary location later. - 007 and onwards: These patches will update pg_verifybackup to remove the restriction on WAL parsing for tar-formatted backups. 008 patch renames the "--wal-directory" switch to "--wal-path" to make it more generic, allowing it accepts a directory path or a tar archive path. ----------------------------------- Known Issues & Status: ----------------------------------- - Timeline Switching: The current implementation in patch 006 does not correctly handle timeline switching. This is a known issue, especially when a timeline change occurs on a WAL file that has been written to a temporary location. - Testing: Local regression tests on CentOS and macOS M4 are passing. However, some tests on macOS Sonoma (specifically 008_untar.pl and 010_client_untar.pl) are failing in the GitHub workflow with a "WAL parsing failed for timeline 1" error. This issue is currently being investigated. Please take a look at the attached patch and let me know your thoughts. This is an initial version, and I am making incremental improvements to address known issues and limitations. 1] https://git.postgresql.org/pg/commitdiff/8dfd3129027969fdd2d9d294220c867d2efd84aa -- Regards, Amul Sul EDB: http://www.enterprisedb.com
Вложения
- v1-0001-Refactor-pg_waldump-Move-some-declarations-to-new.patch
- v1-0002-Refactor-pg_waldump-Separate-logic-used-to-calcul.patch
- v1-0003-Refactor-pg_waldump-Restructure-TAP-tests.patch
- v1-0004-pg_waldump-Rename-directory-creation-routine-for-.patch
- v1-0005-pg_waldump-Add-support-for-archived-WAL-decoding.patch
- v1-0006-WIP-pg_waldump-Remove-the-restriction-on-the-orde.patch
- v1-0007-pg_verifybackup-Delay-default-WAL-directory-prepa.patch
- v1-0008-pg_verifybackup-Rename-the-wal-directory-switch-t.patch
- v1-0009-pg_verifybackup-enabled-WAL-parsing-for-tar-forma.patch
On Thu, Aug 7, 2025 at 7:47 PM Amul Sul <sulamul@gmail.com> wrote: > [....] > ----------------------------------- > Known Issues & Status: > ----------------------------------- > - Timeline Switching: The current implementation in patch 006 does not > correctly handle timeline switching. This is a known issue, especially > when a timeline change occurs on a WAL file that has been written to a > temporary location. > This is still pending and will be addressed in the next version. Therefore, patch 0006 remains marked as WIP. > - Testing: Local regression tests on CentOS and macOS M4 are passing. > However, some tests on macOS Sonoma (specifically 008_untar.pl and > 010_client_untar.pl) are failing in the GitHub workflow with a "WAL > parsing failed for timeline 1" error. This issue is currently being > investigated. > This has been fixed in the attached version; all GitHub workflow tests are now fine. Regards, Amul
Вложения
- v2-0001-Refactor-pg_waldump-Move-some-declarations-to-new.patch
- v2-0002-Refactor-pg_waldump-Separate-logic-used-to-calcul.patch
- v2-0003-Refactor-pg_waldump-Restructure-TAP-tests.patch
- v2-0004-pg_waldump-Rename-directory-creation-routine-for-.patch
- v2-0005-pg_waldump-Add-support-for-archived-WAL-decoding.patch
- v2-0006-WIP-pg_waldump-Remove-the-restriction-on-the-orde.patch
- v2-0007-pg_verifybackup-Delay-default-WAL-directory-prepa.patch
- v2-0008-pg_verifybackup-Rename-the-wal-directory-switch-t.patch
- v2-0009-pg_verifybackup-enabled-WAL-parsing-for-tar-forma.patch
On Mon, Aug 25, 2025 at 5:58 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Aug 7, 2025 at 7:47 PM Amul Sul <sulamul@gmail.com> wrote: > > [....] > > ----------------------------------- > > Known Issues & Status: > > ----------------------------------- > > - Timeline Switching: The current implementation in patch 006 does not > > correctly handle timeline switching. This is a known issue, especially > > when a timeline change occurs on a WAL file that has been written to a > > temporary location. > > > > This is still pending and will be addressed in the next version. > Therefore, patch 0006 remains marked as WIP. > After testing pg_waldump, I have realised that my previous understanding of its timeline handling was incorrect. I had mistakenly assumed by reading xlogreader code that it would use the same timeline-switching logic found in xlogreader, without first verifying this behavior. In testing, I found that pg_waldump does not follow timeline switches. Instead, it expects all WAL files to be from a single timeline, which is either specified by the user or determined from the starting segment or default 1. This is a positive finding, as it means we don't need to make significant changes to align pg_waldump's current behavior. The attached patches are now complete and no longer works in progress -- read for review. Additionally, I've dropped patch v2-0004 because it is no longer necessary. The primary patches that implement the proposed feature are now 0004 and 0005 in the attached set. Regards, Amul
Вложения
- v3-0001-Refactor-pg_waldump-Move-some-declarations-to-new.patch
- v3-0002-Refactor-pg_waldump-Separate-logic-used-to-calcul.patch
- v3-0003-Refactor-pg_waldump-Restructure-TAP-tests.patch
- v3-0004-pg_waldump-Add-support-for-archived-WAL-decoding.patch
- v3-0005-pg_waldump-Remove-the-restriction-on-the-order-of.patch
- v3-0006-pg_verifybackup-Delay-default-WAL-directory-prepa.patch
- v3-0007-pg_verifybackup-Rename-the-wal-directory-switch-t.patch
- v3-0008-pg_verifybackup-enabled-WAL-parsing-for-tar-forma.patch
On Tue, Aug 26, 2025 at 1:53 PM Amul Sul <sulamul@gmail.com> wrote:
>
[..patch]
Hi Amul!
0001: LGTM, maybe I would just slightly enhance the commit message
("This is in preparation for adding a second source file to this
directory.") -- maye bit a bit more verbose or use a message from
0002?
0002: LGTM
0003: LGTM
Tested here (after partial patch apply, and test suite did work fine).
0004:
a. Why should it be necessary to provide startLSN (-s) ? Couldn't
it autodetect the first WAL (tar file) inside and just use that with
some info message?
$ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar
pg_waldump: error: no start WAL location given
b. Why would it like to open "blah" dir if I wanted that "blah"
segment from the archive? Shouldn't it tell that it was looking in the
archive and couldn find it inside?
$ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar blah
pg_waldump: error: could not open file "blah": Not a directory
c. It doesnt work when using SEGSTART, but it's there:
$ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar
000000010000000000000059
pg_waldump: error: could not open file "000000010000000000000059":
Not a directory
$ tar tf /tmp/base/pg_wal.tar | head -1
000000010000000000000059
d. I've later noticed that follow-up patches seem to use the
-s switch and there it seems to work OK. The above SEGSTART issue was
not detected, probably because tests need to be extended cover of
segment name rather than just --start LSN (see test_pg_waldump):
$ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar --stats
-s 0/59000358
pg_waldump: first record is after 0/59000358, at 0/590003E8,
skipping over 144 bytes
WAL statistics between 0/590003E8 and 0/61000000:
[..]
e. Code around`if (walpath == NULL && directory != NULL)` needs
some comments.
f. Code around `if (fname != NULL && is_tar_file(fname,
&compression))` , so if fname is WAL segment here
(00000001000000000000005A) and we do check again if that has been
tar-ed (is_tar_file())? Why?
g. Just a question: the commit message says `Note that this patch
requires that the WAL files within the archive be in sequential order;
an error will be reported otherwise`. I'm wondering if such
occurrences are known to be happening in the wild? Or is it just an
assumption that if someone would modify the tar somehow? (either way
we could just add a reason why we need to handle such a case if we
know -- is manual alternation the only source of such state?). For the
record, I've tested crafting custom archives with out of sequence WAL
archives and the code seems to work (it was done using: tar --append
-f pg_wal.tar --format=ustar ..)
h. Anyway, in case of typo/wrong LSN, 0004 emits wrong error
message I think:
$ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar --stats
-s 0/50000358
pg_waldump: error: WAL files are not archived in sequential order
pg_waldump: detail: Expecting segment number 80 but found 89.
it's just that the 50000358 LSN above is below the minimal LSN
present in the WAL segments (first segment is 000000010000000000000059
there, i've just intentionally provided a bad value 50.. as a typo and
it causes the wrong message). Now it might not be an issue as with
0005 patch the same test behaves OK (`pg_waldump: error: could not
find a valid record after 0/50000358`). It is just relevant if this
would be committed not all at once.
i. If I give wrong --timeline=999 to pg_waldump it fails with
misleading error message: could not read WAL data from "pg_wal.tar"
archive: read -1 of 8192
0005:
a. I'm wondering if we shouldn't log (to stderr?) some kind of
notification message (just once) that non-sequential WAL files were
discovered and that pg_waldump is starting to write to $somewhere as
it may be causing bigger I/O than anticipated when running the
command. This can easily help when troubleshooting why it is not fast,
and also having set TMPDIR to usually /tmp can be slow or too small.
b. IMHO member_prepare_tmp_write() / get_tmp_wal_file_path() with
TMPDIR can be prone to symlink attack. Consider setting TMPDIR=/tmp .
We are writing to e.g. /tmp/<WALsegment>.waldump.tmp in 0004 , but
that path is completely guessable. If an attacker prepares some
symlinks and links those to some other places, I think the code will
happily open and overwrite the contents of the rogue symlink. I think
using mkstemp(3)/tmpfile(3) would be a safer choice if TMPDIR needs to
be in play. Consider that pg_waldump can be run as root (there's no
mechanism preventing it from being used that way).
c. IMHO that unlink() might be not guaranteed to always remove
files, as in case of any trouble and exit() , those files might be
left over. I think we need some atexit() handlers. This can be
triggered with combo of options of nonsequential files in tar + wrong
LSN given:
$ tar tf pg_wal.tar
00000001000000000000005A
00000001000000000000005B
00000001000000000000005C
[..]
000000010000000000000060
000000010000000000000059 <-- out of order, appended last
$ ls -lh 0*
ls: cannot access '0*': No such file or directory
$ /usr/pgsql19/bin/pg_waldump --path=/tmp/ble/pg_wal.tar --stats
-s 0/10000358 #wrong LSN
pg_waldump: error: could not find a valid record after 0/10000358
$ ls -lh 0*
-rw------- 1 postgres postgres 16M Sep 8 14:44
000000010000000000000059.waldump.tmp
-rw------- 1 postgres postgres 16M Sep 8 14:44
00000001000000000000005A.waldump.tmp
[..]
0006: LGTM
0007:
a. Commit message says `Future patches to pg_waldump will enable
it to decode WAL directly` , but those pg_waldump are earlier patches,
right?
b. pg_verifybackup should print some info with --progress that it
is spawning pg_waldump (pg_verifybackup --progress mode does not
display anything related to verifing WALs, but it could)
c. I'm wondering, but pg_waldump seems to be not complaining if
--end=LSN is made into such a future that it doesn't exist. E.g. If
the latest WAL segment is 60 (with end LSN 0/60A77A59), but I run
pg_waldump `--end=0/7000000` , it will return code 0 and nothing on
stderr. So how sure are we that the necessary WAL segments (as per
backup_manifest) are actually inside the tar? It's supposed to be
verified, but it isn't for this use case? Same happens if craft
special tar and remove just one WAL segment from pg_wal.tar (simulate
missing WAL segment), but ask the pg_verifybackup/pg_waldump to verify
it to exact last LSN sequence, e.g.:
$ /usr/pgsql19/bin/pg_waldump --quiet
--path=/tmp/missing/pg_wal.tar --timeline=1 --start=0/59000028
--end=0/60A77A58 && echo OK # but it is not OK
OK
$ /usr/pgsql19/bin/pg_waldump --stats
--path=/tmp/missing/pg_wal.tar --timeline=1 --start=0/59000028
--end=0/60A77A58
WAL statistics between 0/59000028 and 0/5CFFFFD0: # <-- 0/5C LSN
maximum detected
[..]
Notice it has read till 0/5C (but I've asked till 0/60), because
I've removed 0D:
$ tar tf /tmp/missing/pg_wal.tar| grep ^0
000000010000000000000059
00000001000000000000005A
00000001000000000000005B
00000001000000000000005C
00000001000000000000005E <-- missing 5D
Yet it reported no errors.
0008:
LGTM
Another open question I have is this: shouldn't backup_manifest come
with CRC checksum for the archived WALs? Or does that guarantee that
backup_manifest WAL-Ranges are present in pg_wal.tar is good enough
because individual WAL files are CRC-protected itself?
-J.
On Mon, Sep 8, 2025 at 7:07 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Tue, Aug 26, 2025 at 1:53 PM Amul Sul <sulamul@gmail.com> wrote: > > > [..patch] > > Hi Amul! > Thanks for your review. I'm replying to a few of your comments now, but for the rest, I need to think about them. I'm kind of in agreement with some of them for the fix, but I won't be able to spend time on that next week due to official travel. I'll try to get back as soon as possible after that. > a. Why should it be necessary to provide startLSN (-s) ? Couldn't > it autodetect the first WAL (tar file) inside and just use that with > some info message? > $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar > pg_waldump: error: no start WAL location given > There are two reasons. First, existing pg_waldump --path=some_directory would result in the same error. Second, it would force us to re-read the archive twice just to locate the first WAL segment, which is inefficient. > c. It doesnt work when using SEGSTART, but it's there: > $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar > 000000010000000000000059 > pg_waldump: error: could not open file "000000010000000000000059": > Not a directory > $ tar tf /tmp/base/pg_wal.tar | head -1 > 000000010000000000000059 > I don't believe this is the correct use case. The WAL files are inside a tar archive, and the requirement is to use a starting LSN and a timeline (if not the default). > d. I've later noticed that follow-up patches seem to use the > -s switch and there it seems to work OK. The above SEGSTART issue was > not detected, probably because tests need to be extended cover of > segment name rather than just --start LSN (see test_pg_waldump): > $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar --stats > -s 0/59000358 > pg_waldump: first record is after 0/59000358, at 0/590003E8, > skipping over 144 bytes > WAL statistics between 0/590003E8 and 0/61000000: > [..] > Hope previous reasoning makes sense to you. > e. Code around`if (walpath == NULL && directory != NULL)` needs > some comments. > I think this is an existing one. > f. Code around `if (fname != NULL && is_tar_file(fname, > &compression))` , so if fname is WAL segment here > (00000001000000000000005A) and we do check again if that has been > tar-ed (is_tar_file())? Why? > Again, how? > g. Just a question: the commit message says `Note that this patch > requires that the WAL files within the archive be in sequential order; > an error will be reported otherwise`. I'm wondering if such > occurrences are known to be happening in the wild? Or is it just an > assumption that if someone would modify the tar somehow? (either way > we could just add a reason why we need to handle such a case if we > know -- is manual alternation the only source of such state?). For the > record, I've tested crafting custom archives with out of sequence WAL > archives and the code seems to work (it was done using: tar --append > -f pg_wal.tar --format=ustar ..) > This is an almost nonexistent occurrence. While pg_basebackup archives WAL files in sequential order, we don't have an explicit code to enforce that order within it. Furthermore, since we can't control how external tools might handle the files, this extra precaution is necessary. > Another open question I have is this: shouldn't backup_manifest come > with CRC checksum for the archived WALs? Or does that guarantee that > backup_manifest WAL-Ranges are present in pg_wal.tar is good enough > because individual WAL files are CRC-protected itself? > I don't know, I have to check pg_verifybackup. Regards, Amul
Here are some review comments on v3-0004:
In general, I think this looks pretty nice, but I think it needs more
cleanup and polishing.
There doesn't seem to be any reason for
astreamer_waldump_content_new() to take an astreamer *next argument.
If you look at astreamer.h, you'll see that some astreamer_BLAH_new()
functions take such an argument, and others don't. The ones that do
forward their input to another astreamer; the ones that don't, like
astreamer_plain_writer_new(), send it somewhere else. AFAICT, this
astreamer is never going to send its output to another astreamer, so
there's no reason for this argument.
I'm also a little confused by the choice of the name
astreamer_waldump_content_new(). I would have thought this would be
something like astreamer_waldump_new() or astreamer_xlogreader_new().
The word "content" doesn't seem to me to be adding much here, and it
invites confusion with the "content" callback.
I think you can merge setup_astreamer() into
init_tar_archive_reader(). The only other caller is
verify_tar_archive(), but that does exactly the same additional steps
as init_tar_archive_reader(), as far as I can see.
The return statement for astreamer_wal_read is really odd:
+ return (count - nbytes) ? (count - nbytes) : -1;
Since 0 is false in C, this is equivalent to: count != nbytes ? count
- nbytes : -1, but it's a strange way to write it. What makes it even
stranger is that it seems as though the intention here is to count the
number of bytes read, but you do that by taking the number of bytes
requested (count) and subtracting the number of bytes we didn't manage
to read (nbytes); and then you just up and return -1 instead of 0
whenever the answer would have been zero. This is all lacking in
comments and seems a bit more confusing than it needs to be. So my
suggestions are:
1. Consider redefining nbytes to be the number of bytes that you have
read instead of the number of bytes you haven't read. So the loop in
this function would be while (nbytes < count) instead of while (nbytes
> 0).
2. If you need to map 0 to -1, consider having the caller do this
instead of putting that inside this function.
3. Add a comment saying what the return value is supposed to be".
If you do both 1 and 2, then the return statement can just say "return
nbytes;" and the comment can say "Returns the number of bytes
successfully read."
I would suggest changing the name of the variable from "readBuff" to
"readBuf". There are no existing uses of readBuff in the code base.
I think this comment also needs improvement:
+ /*
+ * Ignore existing data if the required target page
has not yet been
+ * read.
+ */
+ if (recptr >= endPtr)
+ {
+ len = 0;
+
+ /* Reset the buffer */
+ resetStringInfo(astreamer_buf);
+ }
This comment is problematic for a few reasons. First, we're not
ignoring the existing data: we're throwing it out. Second, the comment
doesn't say why we're doing what we're doing, only that we're doing
it. Here's my guess at the actual explanation -- please correct me if
I'm wrong: "pg_waldump never reads the same WAL bytes more than once,
so if we're now being asked for data beyond the end of what we've
already read, that means none of the data we currently have in the
buffer will ever be consulted again. So, we can discard the existing
buffer contents and start over." By the way, if this explanation is
correct, it might be nice to add an assertion someplace that verifies
it, like asserting that we're always reading from an LSN greater than
or equal to (or exactly equal to?) the LSN immediately following the
last data we read.
In general, I wonder whether there's a way to make the separation of
concerns between astreamer_wal_read() and TarWALDumpReadPage()
cleaner. Right now, the latter is basically a stub, but I'm not sure
that is the best thing here. I already mentioned one example of how to
do this: make the responsibility for 0 => -1 translation the job of
TarWALDumpReadPage() rather than astreamer_wal_read(). But I think
there might be a little more we can do. In particular, I wonder
whether we could say that astreamer_wal_read() is only responsible for
filling the buffer, and the caller, TarWALDumpReadPage() in this case,
needs to empty it. That seems like it might produce a cleaner
separation of duties.
Another thing that isn't so nice right now is that
verify_tar_archive() has to open and close the archive only for
init_tar_archive_reader() to be called to reopen it again just moments
later. It would be nicer to open the file just once and then keep it
open. Here again, I wonder if the separation of duties could be a bit
cleaner.
Is there a real need to pass XLogDumpPrivate to astreamer_wal_read or
astreamer_archive_read? The only things that they need are archive_fd,
archive_name, archive_streamer, archive_streamer_buf, and
archive_streamer_read_ptr. In other words, they really don't care
about any of the *existing* things that are in XLogDumpPrivate. This
makes me wonder whether we should actually try to make this new
astreamer completely independent of xlogreader. In other words,
instead of calling it astreamer_waldump() or astreamer_xlogreader() as
I proposed above, maybe it could be a completely generic astreamer,
say astreamer_stringinfo_new(StringInfo *buf) that just appends to the
buffer. That would require also moving the stuff out of
astreamer_wal_read() that knows about XLogRecPtr, but why does that
function need to know about XLogRecPtr? Couldn't the caller figure out
that part and just tell this function how many bytes are needed?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Sep 12, 2025 at 2:28 PM Robert Haas <robertmhaas@gmail.com> wrote: > Is there a real need to pass XLogDumpPrivate to astreamer_wal_read or > astreamer_archive_read? The only things that they need are archive_fd, > archive_name, archive_streamer, archive_streamer_buf, and > archive_streamer_read_ptr. In other words, they really don't care > about any of the *existing* things that are in XLogDumpPrivate. This > makes me wonder whether we should actually try to make this new > astreamer completely independent of xlogreader. In other words, > instead of calling it astreamer_waldump() or astreamer_xlogreader() as > I proposed above, maybe it could be a completely generic astreamer, > say astreamer_stringinfo_new(StringInfo *buf) that just appends to the > buffer. That would require also moving the stuff out of > astreamer_wal_read() that knows about XLogRecPtr, but why does that > function need to know about XLogRecPtr? Couldn't the caller figure out > that part and just tell this function how many bytes are needed? Hmm, on further thought, I think this was a silly idea. Part of the intended function of this astreamer is to make sure we're only reading WAL files from the archive, and eventually reordering them if required, so obviously something completely generic isn't going to work. Maybe there's a way to make this look a little cleaner and tidier but this isn't it... -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Sep 12, 2025 at 4:25 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 7:07 PM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 1:53 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > [..patch]
> >
> > Hi Amul!
> >
>
> Thanks for your review. I'm replying to a few of your comments now,
> but for the rest, I need to think about them. I'm kind of in agreement
> with some of them for the fix, but I won't be able to spend time on
> that next week due to official travel. I'll try to get back as soon as
> possible after that.
>
Reverting on rest of review comments:
> 0001: LGTM, maybe I would just slightly enhance the commit message
> ("This is in preparation for adding a second source file to this
> directory.") -- maye bit a bit more verbose or use a message from
> 0002?
Done.
> b. Why would it like to open "blah" dir if I wanted that "blah"
> segment from the archive? Shouldn't it tell that it was looking in the
> archive and couldn find it inside?
> $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar blah
> pg_waldump: error: could not open file "blah": Not a directory
Now, an error will be thrown if any additional command-line
arguments are provided when an archive is specified, similar to how
existing extra arguments are handled.
> i. If I give wrong --timeline=999 to pg_waldump it fails with
> misleading error message: could not read WAL data from "pg_wal.tar"
> archive: read -1 of 8192
Now., added a much better error message for that case.
> a. I'm wondering if we shouldn't log (to stderr?) some kind of
> notification message (just once) that non-sequential WAL files were
> discovered and that pg_waldump is starting to write to $somewhere as
> it may be causing bigger I/O than anticipated when running the
> command. This can easily help when troubleshooting why it is not fast,
> and also having set TMPDIR to usually /tmp can be slow or too small.
Now, emitting info messages, but I'm not sure whether we should have
info or debug.
> b. IMHO member_prepare_tmp_write() / get_tmp_wal_file_path() with
> TMPDIR can be prone to symlink attack. Consider setting TMPDIR=/tmp .
> We are writing to e.g. /tmp/<WALsegment>.waldump.tmp in 0004 , but
> that path is completely guessable. If an attacker prepares some
> symlinks and links those to some other places, I think the code will
> happily open and overwrite the contents of the rogue symlink. I think
> using mkstemp(3)/tmpfile(3) would be a safer choice if TMPDIR needs to
> be in play. Consider that pg_waldump can be run as root (there's no
> mechanism preventing it from being used that way).
I am not sure what the worst-case scenario would be or what a good
alternative is.
> c. IMHO that unlink() might be not guaranteed to always remove
> files, as in case of any trouble and exit() , those files might be
> left over. I think we need some atexit() handlers. This can be
> triggered with combo of options of nonsequential files in tar + wrong
> LSN given:
Done.
> 0007:
> a. Commit message says `Future patches to pg_waldump will enable
> it to decode WAL directly` , but those pg_waldump are earlier patches,
> right?
Right, fixed.
> b. pg_verifybackup should print some info with --progress that it
> is spawning pg_waldump (pg_verifybackup --progress mode does not
> display anything related to verifing WALs, but it could)
If we decide to do that, it could be a separate project, IMHO.
> c. I'm wondering, but pg_waldump seems to be not complaining if
> --end=LSN is made into such a future that it doesn't exist.
The behavior will be kept as if a directory was provided with a start
and end LSN.
Thanks again for the review. I'll post the new patches in my next reply.
Regards,
Amul
On Fri, Sep 12, 2025 at 11:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Here are some review comments on v3-0004:
>
Thanks for the review. My replies are below.
> There doesn't seem to be any reason for
> astreamer_waldump_content_new() to take an astreamer *next argument.
> If you look at astreamer.h, you'll see that some astreamer_BLAH_new()
> functions take such an argument, and others don't. The ones that do
> forward their input to another astreamer; the ones that don't, like
> astreamer_plain_writer_new(), send it somewhere else. AFAICT, this
> astreamer is never going to send its output to another astreamer, so
> there's no reason for this argument.
>
Done.
> I'm also a little confused by the choice of the name
> astreamer_waldump_content_new(). I would have thought this would be
> something like astreamer_waldump_new() or astreamer_xlogreader_new().
> The word "content" doesn't seem to me to be adding much here, and it
> invites confusion with the "content" callback.
>
Done -- renamed to astreamer_waldump_new().
> I think you can merge setup_astreamer() into
> init_tar_archive_reader(). The only other caller is
> verify_tar_archive(), but that does exactly the same additional steps
> as init_tar_archive_reader(), as far as I can see.
>
Done.
> The return statement for astreamer_wal_read is really odd:
>
> + return (count - nbytes) ? (count - nbytes) : -1;
>
Agreed, that's a bit odd. This seems to be leftover code from the experimental
patch. The astreamer_wal_read() function should behave like WALRead():
it should either successfully read all the requested bytes or throw an
error. Corrected in the attached version.
>
> I would suggest changing the name of the variable from "readBuff" to
> "readBuf". There are no existing uses of readBuff in the code base.
>
The existing WALDumpReadPage() function has a "readBuff" argument, and
I've used it that way for consistency.
> I think this comment also needs improvement:
>
> + /*
> + * Ignore existing data if the required target page
> has not yet been
> + * read.
> + */
> + if (recptr >= endPtr)
> + {
> + len = 0;
> +
> + /* Reset the buffer */
> + resetStringInfo(astreamer_buf);
> + }
>
> This comment is problematic for a few reasons. First, we're not
> ignoring the existing data: we're throwing it out. Second, the comment
> doesn't say why we're doing what we're doing, only that we're doing
> it. Here's my guess at the actual explanation -- please correct me if
> I'm wrong: "pg_waldump never reads the same WAL bytes more than once,
> so if we're now being asked for data beyond the end of what we've
> already read, that means none of the data we currently have in the
> buffer will ever be consulted again. So, we can discard the existing
> buffer contents and start over." By the way, if this explanation is
> correct, it might be nice to add an assertion someplace that verifies
> it, like asserting that we're always reading from an LSN greater than
> or equal to (or exactly equal to?) the LSN immediately following the
> last data we read.
>
Updated the comment. The similar assertion exists right before
copying to the readBuff.
>
> Another thing that isn't so nice right now is that
> verify_tar_archive() has to open and close the archive only for
> init_tar_archive_reader() to be called to reopen it again just moments
> later. It would be nicer to open the file just once and then keep it
> open. Here again, I wonder if the separation of duties could be a bit
> cleaner.
>
Prefer to keep those separate, assuming that reopening the file won't
cause any significant harm. Let me know if you think otherwise.
Attached the updated version, kindly have a look.
Regards,
Amul
Вложения
- v4-0001-Refactor-pg_waldump-Move-some-declarations-to-new.patch
- v4-0002-Refactor-pg_waldump-Separate-logic-used-to-calcul.patch
- v4-0003-Refactor-pg_waldump-Restructure-TAP-tests.patch
- v4-0004-pg_waldump-Add-support-for-archived-WAL-decoding.patch
- v4-0005-pg_waldump-Remove-the-restriction-on-the-order-of.patch
- v4-0006-pg_verifybackup-Delay-default-WAL-directory-prepa.patch
- v4-0007-pg_verifybackup-Rename-the-wal-directory-switch-t.patch
- v4-0008-pg_verifybackup-enabled-WAL-parsing-for-tar-forma.patch
On Thu, Sep 25, 2025 at 4:25 AM Amul Sul <sulamul@gmail.com> wrote: > > Another thing that isn't so nice right now is that > > verify_tar_archive() has to open and close the archive only for > > init_tar_archive_reader() to be called to reopen it again just moments > > later. It would be nicer to open the file just once and then keep it > > open. Here again, I wonder if the separation of duties could be a bit > > cleaner. > > Prefer to keep those separate, assuming that reopening the file won't > cause any significant harm. Let me know if you think otherwise. Well, I guess I'd like to know why we can't do better. I'm not really worried about performance, but reopening the file means that you can never make it work with reading from a pipe. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 29, 2025 at 8:45 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 25, 2025 at 4:25 AM Amul Sul <sulamul@gmail.com> wrote: > > > Another thing that isn't so nice right now is that > > > verify_tar_archive() has to open and close the archive only for > > > init_tar_archive_reader() to be called to reopen it again just moments > > > later. It would be nicer to open the file just once and then keep it > > > open. Here again, I wonder if the separation of duties could be a bit > > > cleaner. > > > > Prefer to keep those separate, assuming that reopening the file won't > > cause any significant harm. Let me know if you think otherwise. > > Well, I guess I'd like to know why we can't do better. I'm not really > worried about performance, but reopening the file means that you can > never make it work with reading from a pipe. I have some skepticism regarding the extra coding that might be introduced, as performance is not my primary concern here. If we aim to keep the file open only once, that logic should be implemented before calling verify_tar_archive(), not inside it. Implementing the open and close logic within verify_tar_archive() and free_tar_archive_reader() would create a confusing and scattered pattern, especially since these separate operations require only two lines of code each (open and close if it's a tar file). My second, concern is that after verify_tar_archive(), we might need to reset the file reader offset to the beginning. While reusing the buffered data from the first iteration is technically possible, that only works if the desired start LSN is at the absolute beginning of the archive, or later in the sequence, which cannot be reliably guaranteed. Therefore, for simplicity and avoid the complexity of managing that offset reset code, I am thinking of a simpler approach. Regards, Amul
On Mon, Sep 29, 2025 at 12:17 PM Amul Sul <sulamul@gmail.com> wrote:
> While reusing the buffered data
> from the first iteration is technically possible, that only works if
> the desired start LSN is at the absolute beginning of the archive, or
> later in the sequence, which cannot be reliably guaranteed.
I spent a bunch of time studying this code today and I think that the
problem you're talking about here is evidence of a design problem with
astreamer_wal_read() and some of the other code in
astreamer_waldump.c. Your code calls astreamer_wal_read() when it
wants to peek at the first xlog block to determine the WAL segment
size, and it also calls astreamer_wal_read() when it wants read WAL
sequentially beginning at the start LSN and continuing until it
reaches the end LSN. However, these two cases have very different
requirements. verify_tar_archive(), which is misleadingly named and
really exists to determine the WAL segment size, just wants to read
the first xlog block that physically appears in the archive. Every
xlog block will have the same WAL segment size, so it does not matter
which one we read. On the other hand, TarWALDumpReadPage wants to read
WAL in sequential order. In other words, one call to
astreamer_wal_read() really wants to read a block without any block
reordering, and the other call wants to read a block with block
reordering.
To me, it looks like the problem here is that the block reordering
functionality should live on top of the astreamer, not inside of it.
Imagine that astreamer just spits out the bytes in the order in which
they physically appear in the archive, and then there's another
component that consumes and reorders those bytes. So, you read data
and push it into the astreamer until the number of bytes in the output
buffer is at least XLOG_BLCKSZ, and then from there you extract the
WAL segment size. Then, you call XLogReaderAllocate() and enter the
main loop. The reordering logic lives inside of TarWALDumpReadPage().
Each time it gets data from the astreamer's buffer, it either returns
it to the caller if it's in order or buffers it using temporary files
if not.
I found it's actually quite easy to write a patch that avoids
reopening the file. Here it is, on top of your v4:
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2c42df46d43..c4346a5e211 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -368,17 +368,8 @@ init_tar_archive_reader(XLogDumpPrivate *private,
const char *waldir,
XLogRecPtr startptr, XLogRecPtr endptr,
pg_compress_algorithm compression)
{
- int fd;
astreamer *streamer;
- /* Open tar archive and store its file descriptor */
- fd = open_file_in_directory(waldir, private->archive_name);
-
- if (fd < 0)
- pg_fatal("could not open file \"%s\"", private->archive_name);
-
- private->archive_fd = fd;
-
/*
* Create an appropriate chain of archive streamers for reading the given
* tar archive.
@@ -1416,12 +1407,22 @@ main(int argc, char **argv)
/* we have everything we need, start reading */
if (is_tar)
{
+ /* Open tar archive and store its file descriptor */
+ private.archive_fd =
+ open_file_in_directory(waldir, private.archive_name);
+ if (private.archive_fd < 0)
+ pg_fatal("could not open file \"%s\"", private.archive_name);
+
/* Verify that the archive contains valid WAL files */
waldir = waldir ? pg_strdup(waldir) : pg_strdup(".");
init_tar_archive_reader(&private, waldir, InvalidXLogRecPtr,
InvalidXLogRecPtr, compression);
verify_tar_archive(&private);
- free_tar_archive_reader(&private);
+ astreamer_free(private.archive_streamer);
+
+ if (lseek(private.archive_fd, 0, SEEK_SET) != 0)
+ pg_log_error("could not seek in file \"%s\": %m",
+ private.archive_name);
/* Set up for reading tar file */
init_tar_archive_reader(&private, waldir, private.startptr,
Of course, this is not really what we want to do: it avoids reopening
the file, but because we can't back up the archive streamer once it's
been created, we have to lseek back to the beginning of the file. But
notice how silly this looks: with this patch, we free the archive
reader and immediately create a new archive reader that is exactly the
same in every way except that we call astreamer_waldump_new(startptr,
endptr, private) instead of astreamer_waldump_new(InvalidXLogRecPtr,
InvalidXLogRecPtr, private). We could arrange to update the original
archive streamer with new values of startSegNo and endSegNo after
verify_tar_archive(), but that's still not quite good enough, because
we might have already made some decisions on what to do with the data
that we read that it's too late to reverse. But, what that means is
that the astreamer_waldump machinery is not smart enough to read one
block of data without making irreversible decisions from which we
can't recover without recreating the entire object. I think we can,
and should, try to do better.
It's also worth noting that the unfortunate layering doesn't just
require us to read the first block of the file: it also complicates
the code in various places. The fact that astreamer_wal_read() needs a
special case for XLogRecPtrIsInvalid(recptr) is a direct result of
this problem, and the READ_ANY_WAL() macro and both the places that
test it are also direct results of this problem. In other words, I'm
arguing that astreamer_wal_read() is incorrectly defined, and that
error creates ugliness in the code both above and below
astreamer_wal_read().
While I'm on the topic of astreamer_wal_read(), here are a few other
problems I noticed:
* The return value is not documented, and it seems to always be count,
in which case it might as well return void. The caller already has the
value they passed for count.
* It seems like it would be more appropriate to assert that endPtr >=
len and just set startPtr = endPtr - len. I don't see how len > endPtr
can ever happen, and I bet bad things will happen if it does.
* "pg_waldump never ask the same" -> "pg_waldump never asks for the same"
Also, this is absolutely not OK with me:
/* Fetch more data */
if (astreamer_archive_read(privateInfo) == 0)
{
char fname[MAXFNAMELEN];
XLogSegNo segno;
XLByteToSeg(targetPagePtr, segno, WalSegSz); an
XLogFileName(fname,
privateInfo->timeline, segno, WalSegSz);
pg_fatal("could not find file \"%s\"
in \"%s\" archive",
fname,
privateInfo->archive_name);
}
astreamer_archive_read() will return 0 if we reach the end of the
tarfile, so this is saying that if we reach the end of the tar file
without finding the range of bytes for which we're looking, the
explanation must be that the relevant WAL file is missing from the
archive. But that is way too much action at a distance. I was able to
easily construct a counterexample by copying the first 81920 bytes of
a valid WAL file and then doing this:
[robert.haas pgsql-meson]$ tar tf pg_wal.tar
000000010000000000000005
[robert.haas pgsql-meson]$ pg_waldump -s 0/050008D8 -e 0/05FFED98
pg_wal.tar >/dev/null
pg_waldump: error: could not find file "000000010000000000000005" in
"pg_wal.tar" archive
Without the redirection to /dev/null, what happened was that
pg_waldump printed out a bunch of records from
000000010000000000000005 and then said that 000000010000000000000005
could not be found, which is obviously silly. But the fact that I
found a specific counterexample here isn't even really the point. The
point is that there's a big gap between what we actually know at this
point (which is that we've read the whole input file) and what the
message is claiming (which is that the reason must be that the file is
missing from the archive). Even if the counterexample above didn't
exist and that really were the only way for that to happen as of
today, that's very fragile. Maybe some future code change will make it
so that there's a second reason that could happen. How would somebody
realize that they had created a second condition by means of which
this code could be reached? If they did realize it, how would they get
the correct error to be reported?
I'm not quite sure how this should be fixed, but I strongly suspect
that the error report here needs to move closer to the code that is
doing the file reordering. Aside from the possibility of the file
being missing and the possibility of the file being too short, a third
possibility is that targetPagePtr retreats between one call and the
next. That really shouldn't happen, but there are no asserts here
verifying that it doesn't.
I also don't like the fact that one call to astreamer_archive_read()
checks the return value (but only whether it's zero, the specific
return value apparently doesn't matter, so why doesn't it return
bool?) and the other doesn't. That kind of coding pattern is very
rarely correct. The code says:
/* Continue reading from the open WAL segment, if any */
if (state->seg.ws_file >= 0)
{
/*
* To prevent a race condition where the archive streamer is still
* exporting a file that we are trying to read, we invoke the streamer
* to ensure enough data is available.
*/
if (private->curSegNo == state->seg.ws_segno)
astreamer_archive_read(private);
return WALDumpReadPage(state, targetPagePtr, reqLen, targetPtr,
readBuff);
}
But it's unclear why this should be good enough to ensure that enough
data is available. astreamer_archive_read() might read zero bytes and
return 0, so this doesn't really guarantee anything at all. On the
other hand, even if astereamer_archive_read() returns a non-zero
value, it's only going to read READ_CHUNK_SIZE bytes from the
underlying file, so if more than that needs to be read in order for us
to have enough data, we won't. I think it's very hard to imagine a
situation in which you can call astreamer_archive_read() without using
some loop. That's what astreamer_wal_read() does: it calls
astreamer_archive_read() until it either returns 0 -- in which case we
know we've failed -- or until we have enough data. Here we just hope
that calling it once is enough, and that checking for errors is
unimportant. I also don't understand the reference to a race
condition, because there's only one process with one thread here, I
believe, so what would be racing against?
Another thing I noticed is that astreamer_archive_read() makes
reference to decrypting, but there's no cryptography involved in any
of this.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Oct 10, 2025 at 11:32 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 12:17 PM Amul Sul <sulamul@gmail.com> wrote:
> > While reusing the buffered data
> > from the first iteration is technically possible, that only works if
> > the desired start LSN is at the absolute beginning of the archive, or
> > later in the sequence, which cannot be reliably guaranteed.
>
> I spent a bunch of time studying this code today and I think that the
> problem you're talking about here is evidence of a design problem with
> astreamer_wal_read() and some of the other code in
> astreamer_waldump.c. Your code calls astreamer_wal_read() when it
> wants to peek at the first xlog block to determine the WAL segment
> size, and it also calls astreamer_wal_read() when it wants read WAL
> sequentially beginning at the start LSN and continuing until it
> reaches the end LSN. However, these two cases have very different
> requirements. verify_tar_archive(), which is misleadingly named and
> really exists to determine the WAL segment size, just wants to read
> the first xlog block that physically appears in the archive. Every
> xlog block will have the same WAL segment size, so it does not matter
> which one we read. On the other hand, TarWALDumpReadPage wants to read
> WAL in sequential order. In other words, one call to
> astreamer_wal_read() really wants to read a block without any block
> reordering, and the other call wants to read a block with block
> reordering.
>
> To me, it looks like the problem here is that the block reordering
> functionality should live on top of the astreamer, not inside of it.
> Imagine that astreamer just spits out the bytes in the order in which
> they physically appear in the archive, and then there's another
> component that consumes and reorders those bytes. So, you read data
> and push it into the astreamer until the number of bytes in the output
> buffer is at least XLOG_BLCKSZ, and then from there you extract the
> WAL segment size. Then, you call XLogReaderAllocate() and enter the
> main loop. The reordering logic lives inside of TarWALDumpReadPage().
> Each time it gets data from the astreamer's buffer, it either returns
> it to the caller if it's in order or buffers it using temporary files
> if not.
>
I initially considered implementing the reordering logic outside of
astreamer when we first discussed this project, but the implementation
could get complicated -- or at least feel hacky. Let me explain why:
astreamer reads the archive in fixed-size chunks (here it is 128KB).
Sometimes, a single read can contain data from two WAL files --
specifically, the tail end of one file and the start of the next --
because of how they’re physically stored in the archive. astreamer
knows where one file ends and another begins through tags like
ASTREAMER_MEMBER_HEADER, ASTREAMER_MEMBER_CONTENTS, and
ASTREAMER_MEMBER_TRAILER. However, it can’t pause mid-chunk to hold
data from the next file once the previous one ends and for the caller;
it pushes the entire chunk it has read to the target buffer.
So, if we put the reordering logic outside the streamer, we’d
sometimes be receiving buffers containing mixed data from two WAL
files. The caller would then need to correctly identify WAL file
boundaries within those buffers. This would require passing extra
metadata -- like segment numbers for the WAL files in the buffer, plus
start and end offsets of those segments within the buffer. While not
impossible, it feels a bit hacky and I'm unsure if that’s the best
approach.
> I found it's actually quite easy to write a patch that avoids
> reopening the file. Here it is, on top of your v4:
>
> diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
> index 2c42df46d43..c4346a5e211 100644
> --- a/src/bin/pg_waldump/pg_waldump.c
> +++ b/src/bin/pg_waldump/pg_waldump.c
> @@ -368,17 +368,8 @@ init_tar_archive_reader(XLogDumpPrivate *private,
> const char *waldir,
> XLogRecPtr startptr, XLogRecPtr endptr,
> pg_compress_algorithm compression)
> {
> - int fd;
> astreamer *streamer;
>
> - /* Open tar archive and store its file descriptor */
> - fd = open_file_in_directory(waldir, private->archive_name);
> -
> - if (fd < 0)
> - pg_fatal("could not open file \"%s\"", private->archive_name);
> -
> - private->archive_fd = fd;
> -
> /*
> * Create an appropriate chain of archive streamers for reading the given
> * tar archive.
> @@ -1416,12 +1407,22 @@ main(int argc, char **argv)
> /* we have everything we need, start reading */
> if (is_tar)
> {
> + /* Open tar archive and store its file descriptor */
> + private.archive_fd =
> + open_file_in_directory(waldir, private.archive_name);
> + if (private.archive_fd < 0)
> + pg_fatal("could not open file \"%s\"", private.archive_name);
> +
> /* Verify that the archive contains valid WAL files */
> waldir = waldir ? pg_strdup(waldir) : pg_strdup(".");
> init_tar_archive_reader(&private, waldir, InvalidXLogRecPtr,
> InvalidXLogRecPtr, compression);
> verify_tar_archive(&private);
> - free_tar_archive_reader(&private);
> + astreamer_free(private.archive_streamer);
> +
> + if (lseek(private.archive_fd, 0, SEEK_SET) != 0)
> + pg_log_error("could not seek in file \"%s\": %m",
> + private.archive_name);
>
> /* Set up for reading tar file */
> init_tar_archive_reader(&private, waldir, private.startptr,
>
> Of course, this is not really what we want to do: it avoids reopening
> the file, but because we can't back up the archive streamer once it's
> been created, we have to lseek back to the beginning of the file. But
> notice how silly this looks: with this patch, we free the archive
> reader and immediately create a new archive reader that is exactly the
> same in every way except that we call astreamer_waldump_new(startptr,
> endptr, private) instead of astreamer_waldump_new(InvalidXLogRecPtr,
> InvalidXLogRecPtr, private). We could arrange to update the original
> archive streamer with new values of startSegNo and endSegNo after
> verify_tar_archive(), but that's still not quite good enough, because
> we might have already made some decisions on what to do with the data
> that we read that it's too late to reverse. But, what that means is
> that the astreamer_waldump machinery is not smart enough to read one
> block of data without making irreversible decisions from which we
> can't recover without recreating the entire object. I think we can,
> and should, try to do better.
>
Agreed.
> It's also worth noting that the unfortunate layering doesn't just
> require us to read the first block of the file: it also complicates
> the code in various places. The fact that astreamer_wal_read() needs a
> special case for XLogRecPtrIsInvalid(recptr) is a direct result of
> this problem, and the READ_ANY_WAL() macro and both the places that
> test it are also direct results of this problem. In other words, I'm
> arguing that astreamer_wal_read() is incorrectly defined, and that
> error creates ugliness in the code both above and below
> astreamer_wal_read().
>
> While I'm on the topic of astreamer_wal_read(), here are a few other
> problems I noticed:
>
> * The return value is not documented, and it seems to always be count,
> in which case it might as well return void. The caller already has the
> value they passed for count.
The caller will be xlogreader, and I believe we shouldn't change that.
For the same reason, WALDumpReadPage() also returns the same.
> * It seems like it would be more appropriate to assert that endPtr >=
> len and just set startPtr = endPtr - len. I don't see how len > endPtr
> can ever happen, and I bet bad things will happen if it does.
> * "pg_waldump never ask the same" -> "pg_waldump never asks for the same"
>
Ok.
> Also, this is absolutely not OK with me:
>
> /* Fetch more data */
> if (astreamer_archive_read(privateInfo) == 0)
> {
> char fname[MAXFNAMELEN];
> XLogSegNo segno;
>
> XLByteToSeg(targetPagePtr, segno, WalSegSz); an
> XLogFileName(fname,
> privateInfo->timeline, segno, WalSegSz);
>
> pg_fatal("could not find file \"%s\"
> in \"%s\" archive",
> fname,
> privateInfo->archive_name);
> }
>
> astreamer_archive_read() will return 0 if we reach the end of the
> tarfile, so this is saying that if we reach the end of the tar file
> without finding the range of bytes for which we're looking, the
> explanation must be that the relevant WAL file is missing from the
> archive. But that is way too much action at a distance. I was able to
> easily construct a counterexample by copying the first 81920 bytes of
> a valid WAL file and then doing this:
>
> [robert.haas pgsql-meson]$ tar tf pg_wal.tar
> 000000010000000000000005
> [robert.haas pgsql-meson]$ pg_waldump -s 0/050008D8 -e 0/05FFED98
> pg_wal.tar >/dev/null
> pg_waldump: error: could not find file "000000010000000000000005" in
> "pg_wal.tar" archive
>
> Without the redirection to /dev/null, what happened was that
> pg_waldump printed out a bunch of records from
> 000000010000000000000005 and then said that 000000010000000000000005
> could not be found, which is obviously silly. But the fact that I
> found a specific counterexample here isn't even really the point. The
> point is that there's a big gap between what we actually know at this
> point (which is that we've read the whole input file) and what the
> message is claiming (which is that the reason must be that the file is
> missing from the archive). Even if the counterexample above didn't
> exist and that really were the only way for that to happen as of
> today, that's very fragile. Maybe some future code change will make it
> so that there's a second reason that could happen. How would somebody
> realize that they had created a second condition by means of which
> this code could be reached? If they did realize it, how would they get
> the correct error to be reported?
>
Agreed, I'll think about this.
>
> /* Continue reading from the open WAL segment, if any */
> if (state->seg.ws_file >= 0)
> {
> /*
> * To prevent a race condition where the archive streamer is still
> * exporting a file that we are trying to read, we invoke the streamer
> * to ensure enough data is available.
> */
> if (private->curSegNo == state->seg.ws_segno)
> astreamer_archive_read(private);
>
> return WALDumpReadPage(state, targetPagePtr, reqLen, targetPtr,
> readBuff);
> }
>
> But it's unclear why this should be good enough to ensure that enough
> data is available. astreamer_archive_read() might read zero bytes and
> return 0, so this doesn't really guarantee anything at all. On the
> other hand, even if astereamer_archive_read() returns a non-zero
> value, it's only going to read READ_CHUNK_SIZE bytes from the
> underlying file, so if more than that needs to be read in order for us
> to have enough data, we won't. I think it's very hard to imagine a
> situation in which you can call astreamer_archive_read() without using
> some loop.
The loop isn't needed because the caller always requests 8KB of data,
while READ_CHUNK_SIZE is 128KB. It’s assumed that the astreamer has
already created the file with some initial data. For example, if only
a few bytes have been written so far, when we reach
TarWALDumpReadPage(), it detects that we’re reading the same file
that the astreamer is still writing to and hasn’t finished. It then request to
appends 128KB of data by calling astreamer_archive_read, even though we
only need 8KB at a time. This process repeats each time the next 8KBchunk is
requested: astreamer_archive_read() appends another 128KB,and continues until
the file has been fully read and written.
> That's what astreamer_wal_read() does: it calls
> astreamer_archive_read() until it either returns 0 -- in which case we
> know we've failed -- or until we have enough data. Here we just hope
> that calling it once is enough, and that checking for errors is
> unimportant. I also don't understand the reference to a race
> condition, because there's only one process with one thread here, I
> believe, so what would be racing against?
>
In the case where the astreamer is exporting a file to disk but hasn’t
finished writing it, and we call TarWALDumpReadPage() to request
block(s) from that WAL file, we can read only up to the existing
blocks in the file. Since the file is incomplete, reading may fail
later. To handle this, astreamer_archive_read() is invoked to append
more data -- usually more than the requested amount, as explained
earlier. That is the race condition I am trying to handle.
Now, regarding the concern of astreamer_archive_read() returning zero
without reading or appending any data: this can happen only if the WAL
is shorter than expected -- an incomplete. In that case,
WALDumpReadPage() will raise the appropriate error, we don't have to
check at that point, I think.
> Another thing I noticed is that astreamer_archive_read() makes
> reference to decrypting, but there's no cryptography involved in any
> of this.
>
I think that was a typo -- I meant decompression.
Regards,
Amul
On Thu, Oct 16, 2025 at 7:49 AM Amul Sul <sulamul@gmail.com> wrote:
> astreamer reads the archive in fixed-size chunks (here it is 128KB).
> Sometimes, a single read can contain data from two WAL files --
> specifically, the tail end of one file and the start of the next --
> because of how they’re physically stored in the archive. astreamer
> knows where one file ends and another begins through tags like
> ASTREAMER_MEMBER_HEADER, ASTREAMER_MEMBER_CONTENTS, and
> ASTREAMER_MEMBER_TRAILER. However, it can’t pause mid-chunk to hold
> data from the next file once the previous one ends and for the caller;
> it pushes the entire chunk it has read to the target buffer.
Right, this makes sense.
> So, if we put the reordering logic outside the streamer, we’d
> sometimes be receiving buffers containing mixed data from two WAL
> files. The caller would then need to correctly identify WAL file
> boundaries within those buffers. This would require passing extra
> metadata -- like segment numbers for the WAL files in the buffer, plus
> start and end offsets of those segments within the buffer. While not
> impossible, it feels a bit hacky and I'm unsure if that’s the best
> approach.
I agree that we need that kind of metadata, but I don't see why our
need for it depends on where we do the reordering. That is, if we do
the reordering above the astreamer layer, we need to keep track of the
origin of each chunk of WAL bytes, and if we do the reordering within
the astreamer layer, we still need to keep track of the origin of the
WAL bytes. Doing the ordering properly requires that tracking, but it
doesn't say anything about where that tracking has to be performed.
I think it might be better if we didn't write to the astreamer's
buffer at all. For example, suppose we create a struct that looks
approximately like this:
struct ChunkOfDecodedWAL
{
XLogSegNo segno; // could also be XLogRecPtr start_lsn or char
*walfilename or whatever
StringInfoData buffer;
char *spillfilename; // or whatever we use to identify the temporary files
bool already_removed;
// potentially other metadata
};
Then, create a hash table and key it on the segno whatever. Have the
astreamer write to the hash table: when it gets a chunk of WAL, it
looks up or creates the relevant hash table entry and appends the data
to the buffer. At any convenient point in the code, you can decide to
write the data from the buffer to a spill file, after which you
resetStringInfo() on the buffer and populate the spill file name. When
you've used up the data, you remove the spill file and set the
already_removed flag.
I think this could also help with the error reporting stuff. When you
get to the end of the file, you'll know all the files you saw and how
much data you read from each of them. So you could possibly do
something like
ERROR: LSN %08X/%08X not found in archive "\%s\"
DETAIL: WAL segment %s is not present in the archive
-or
DETAIL: WAL segment %s was expected to be %u bytes, but was only %u bytes
-or-
DETAIL: whatever else can go wrong
The point is that every file you've ever seen has a hash table entry,
and in that hash table entry you can store everything about that file
that you need to know, whether that's the file data, the disk file
that contains the file data, the fact that we already threw the data
away, or any other fact that you can imagine wanting to know.
Said differently, the astreamer buffer is not really a great place to
write data. It exists because when we're just forwarding data from one
astreamer to the next, we will often need to buffer a small amount of
data to avoid terrible performance. However, it's only there to be
used when we don't have something better. I don't think any astreamer
that is intended to be the last one in the chain currently writes to
the buffer -- they write to the output file, or whatever, because
using an in-memory buffer as your final output destination is not a
real good plan.
> > While I'm on the topic of astreamer_wal_read(), here are a few other
> > problems I noticed:
> >
> > * The return value is not documented, and it seems to always be count,
> > in which case it might as well return void. The caller already has the
> > value they passed for count.
>
> The caller will be xlogreader, and I believe we shouldn't change that.
> For the same reason, WALDumpReadPage() also returns the same.
OK, but then you can make that clear via a brief comment.
> The loop isn't needed because the caller always requests 8KB of data,
> while READ_CHUNK_SIZE is 128KB. It’s assumed that the astreamer has
> already created the file with some initial data. For example, if only
> a few bytes have been written so far, when we reach
> TarWALDumpReadPage(), it detects that we’re reading the same file
> that the astreamer is still writing to and hasn’t finished. It then request to
> appends 128KB of data by calling astreamer_archive_read, even though we
> only need 8KB at a time. This process repeats each time the next 8KBchunk is
> requested: astreamer_archive_read() appends another 128KB,and continues until
> the file has been fully read and written.
Sure, but you don't know how much data is going to come out the other
end of the astreamer pipeline. Since the data is (possibly)
compressed, you expect at least as many bytes to emerge from the
output end as you add to the input end, but it's not a good idea to
rely on assumptions like that. Sometimes compressors end up making the
data slightly larger instead of smaller. It's unlikely that the effect
would be so dramatic that adding 128kB to one end of the pipeline
would make less than 8kB emerge from the other end, but it's not a
good idea to rely on assumptions like that. Not that this is a real
thing, but imagine that the compressed file had something in the
middle of it that behaved like a comment in C code, i.e. it didn't
generate any output.
> In the case where the astreamer is exporting a file to disk but hasn’t
> finished writing it, and we call TarWALDumpReadPage() to request
> block(s) from that WAL file, we can read only up to the existing
> blocks in the file. Since the file is incomplete, reading may fail
> later. To handle this, astreamer_archive_read() is invoked to append
> more data -- usually more than the requested amount, as explained
> earlier. That is the race condition I am trying to handle.
That's not what a race condition is:
https://en.wikipedia.org/wiki/Race_condition
> Now, regarding the concern of astreamer_archive_read() returning zero
> without reading or appending any data: this can happen only if the WAL
> is shorter than expected -- an incomplete. In that case,
> WALDumpReadPage() will raise the appropriate error, we don't have to
> check at that point, I think.
I'm not going to accept that kind of justification -- it is too
fragile to assume that you don't need to check for an error because it
"can't happen". Sometimes that is reasonable, but there is quite a lot
of action-at-a-distance here, so it does not feel safe.
> > Another thing I noticed is that astreamer_archive_read() makes
> > reference to decrypting, but there's no cryptography involved in any
> > of this.
>
> I think that was a typo -- I meant decompression.
I figured as much, but it still needs fixing.
--
Robert Haas
EDB: http://www.enterprisedb.com