Re: pg_waldump: support decoding of WAL inside tarfile

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: pg_waldump: support decoding of WAL inside tarfile
Дата
Msg-id 374225.1774459521@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: pg_waldump: support decoding of WAL inside tarfile  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
Michael Paquier <michael@paquier.xyz> writes:
> The buildfarm has switched mostly to green, except on this one:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoatzin&dt=2026-03-23%2006%3A00%3A42

Yeah, there are some other weird failures on other machines too.
There's also the problem we already knew about of FD leakage breaking
cleanup of the temp file directory on Windows.

I wrote a patch to fix the FD leakage problem (v8-0001 attached).
I don't have Windows at hand, but I tested it by dint of having
the atexit callback invoke "lsof" to see if there were any open
files in the temp directory.

I also occasionally saw some of the weird errors mentioned above.
After much debugging, I believe that the issue is that
archive_waldump.c is unaware that inserting or deleting entries
in a simplehash.h hash table can cause other entries to move.
That can break privateInfo->cur_file, and it can also break
read_archive_wal_page which thought it could just re-use its
entry pointer after calling read_archive_file.  v8-0002 attached
fixes that, and I'm not seeing weird failures anymore.

Two additional thoughts:

1. The amount of data that pg_waldump's TAP tests use is not
sufficient to trigger these problems with any degree of reliability.
I'm hesitant to make the tests run longer, but clearly we do not
have adequate coverage now.

2. I didn't do it here, but I urgently think we should rip out
read_archive_wal_page's stanza that truncates the entry's
"buf" string (the "if (privateInfo->decoding_started)" part).
My faith in this code in general is at rock bottom, and my faith in
the extent to which we've tested it is somewhere below ground level.
I don't think we need rickety optimizations that serve only to
keep the active hashtable entry to less than 16MB, when we're going
to reclaim that space altogether as soon as we've finished dumping
that segment.  This truncation scares me because it adds a whole
'nother level of poorly-documented complexity to the invariants
around what is in entry->buf.  Also, while we theoretically should
not need to spill the entry after this point, if we did we would
write a corrupted spill file.

            regards, tom lane

From d21895df67eea1d624e1ffc6eaa7a2c86ef386ff Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 25 Mar 2026 10:46:44 -0400
Subject: [PATCH v8 1/2] Fix file descriptor leakages in pg_waldump.

TarWALDumpCloseSegment was of the opinion that it didn't need to
do anything.  It was mistaken: it has to close the open file if
any, because nothing else will.

In addition, we failed to ensure that any file being read by the
XLogReader machinery gets closed before the atexit callback tries to
cleanup the temporary directory holding spilled WAL files.  While the
file would have been closed already in case of a success exit, this
doesn't happen in case of pg_fatal() exits.  The least messy way
to fix that is to move the atexit function into pg_waldump.c,
where it has easier access to the XLogReaderState pointer and to
WALDumpCloseSegment.

These FD leakages are pretty insignificant on Unix-ish platforms,
but they're a bug on Windows, because they prevent successful cleanup
of the temporary directory for extracted WAL files.  (Windows can't
delete a directory that holds a deleted-but-still-open file.)
This is visible in occasional buildfarm failures.
---
 src/bin/pg_waldump/archive_waldump.c | 26 ++--------------
 src/bin/pg_waldump/pg_waldump.c      | 44 ++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c
index 96f4d94449f..b8e9754b729 100644
--- a/src/bin/pg_waldump/archive_waldump.c
+++ b/src/bin/pg_waldump/archive_waldump.c
@@ -91,7 +91,6 @@ static ArchivedWALFile *get_archive_wal_entry(const char *fname,
                                               XLogDumpPrivate *privateInfo);
 static bool read_archive_file(XLogDumpPrivate *privateInfo);
 static void setup_tmpwal_dir(const char *waldir);
-static void cleanup_tmpwal_dir_atexit(void);

 static FILE *prepare_tmp_write(const char *fname, XLogDumpPrivate *privateInfo);
 static void perform_tmp_write(const char *fname, StringInfo buf, FILE *file);
@@ -607,23 +606,10 @@ setup_tmpwal_dir(const char *waldir)
     pg_log_debug("created directory \"%s\"", TmpWalSegDir);
 }

-/*
- * Remove temporary directory at exit, if any.
- */
-static void
-cleanup_tmpwal_dir_atexit(void)
-{
-    Assert(TmpWalSegDir != NULL);
-
-    rmtree(TmpWalSegDir, true);
-
-    TmpWalSegDir = NULL;
-}
-
 /*
  * Open a file in the temporary spill directory for writing an out-of-order
- * WAL segment, creating the directory and registering the cleanup callback
- * if not already done.  Returns the open file handle.
+ * WAL segment, creating the directory if not already done.
+ * Returns the open file handle.
  */
 static FILE *
 prepare_tmp_write(const char *fname, XLogDumpPrivate *privateInfo)
@@ -631,15 +617,9 @@ prepare_tmp_write(const char *fname, XLogDumpPrivate *privateInfo)
     char        fpath[MAXPGPATH];
     FILE       *file;

-    /*
-     * Setup temporary directory to store WAL segments and set up an exit
-     * callback to remove it upon completion if not already.
-     */
+    /* Setup temporary directory to store WAL segments, if we didn't already */
     if (unlikely(TmpWalSegDir == NULL))
-    {
         setup_tmpwal_dir(privateInfo->archive_dir);
-        atexit(cleanup_tmpwal_dir_atexit);
-    }

     snprintf(fpath, MAXPGPATH, "%s/%s", TmpWalSegDir, fname);

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 630d9859882..6bda3c12aa3 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -42,6 +42,8 @@ static const char *progname;

 static volatile sig_atomic_t time_to_stop = false;

+static XLogReaderState *xlogreader_state_cleanup = NULL;
+
 static const RelFileLocator emptyRelFileLocator = {0, 0, 0};

 typedef struct XLogDumpConfig
@@ -454,13 +456,14 @@ TarWALDumpOpenSegment(XLogReaderState *state, XLogSegNo nextSegNo,

 /*
  * pg_waldump's XLogReaderRoutine->segment_close callback to support dumping
- * WAL files from tar archives.  Segment tracking is handled by
- * TarWALDumpReadPage, so no action is needed here.
+ * WAL files from tar archives.  Same as wal_segment_close.
  */
 static void
 TarWALDumpCloseSegment(XLogReaderState *state)
 {
-    /* No action needed */
+    close(state->seg.ws_file);
+    /* need to check errno? */
+    state->seg.ws_file = -1;
 }

 /*
@@ -865,6 +868,27 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogStats *stats)
            total_len, "[100%]");
 }

+/*
+ * Remove temporary directory at exit, if any.
+ */
+static void
+cleanup_tmpwal_dir_atexit(void)
+{
+    /*
+     * Before calling rmtree, we must close any open file we have in the temp
+     * directory; else rmdir fails on Windows.
+     */
+    if (xlogreader_state_cleanup != NULL &&
+        xlogreader_state_cleanup->seg.ws_file >= 0)
+        WALDumpCloseSegment(xlogreader_state_cleanup);
+
+    if (TmpWalSegDir != NULL)
+    {
+        rmtree(TmpWalSegDir, true);
+        TmpWalSegDir = NULL;
+    }
+}
+
 static void
 usage(void)
 {
@@ -1383,6 +1407,14 @@ main(int argc, char **argv)
     if (!xlogreader_state)
         pg_fatal("out of memory while allocating a WAL reading processor");

+    /*
+     * Set up atexit cleanup of temporary directory.  This must happen before
+     * archive_waldump.c could possibly create the temporary directory.  Also
+     * arm the callback to cleanup the xlogreader state.
+     */
+    atexit(cleanup_tmpwal_dir_atexit);
+    xlogreader_state_cleanup = xlogreader_state;
+
     /* first find a valid recptr to start from */
     first_record = XLogFindNextRecord(xlogreader_state, private.startptr, &errormsg);

@@ -1495,6 +1527,12 @@ main(int argc, char **argv)
                  LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
                  errormsg);

+    /*
+     * Disarm atexit cleanup of open WAL file; XLogReaderFree will close it,
+     * and we don't want the atexit callback trying to touch freed memory.
+     */
+    xlogreader_state_cleanup = NULL;
+
     XLogReaderFree(xlogreader_state);

     if (private.archive_name)
--
2.43.7

From f28cd200bd5db855164aa06be5c3e960a5ab2f0f Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 25 Mar 2026 13:00:54 -0400
Subject: [PATCH v8 2/2] Fix misuse of simplehash.h hash operations in
 pg_waldump.

Both ArchivedWAL_insert and ArchivedWAL_delete_item can cause
existing hashtable entries to move.  The code didn't account
for this and could leave privateInfo->cur_file pointing at a
dead or incorrect entry, with hilarity ensuing.  Likewise,
read_archive_wal_page calls read_archive_file which could
result in movement of the hashtable entry it is working with
(due to hashtable expansion for insertion of new entries).

I believe these bugs explain some odd buildfarm failures,
although the amount of data we use in pg_waldump's TAP tests
isn't enough to trigger them reliably.
---
 src/bin/pg_waldump/archive_waldump.c | 36 ++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c
index b8e9754b729..4348e192f19 100644
--- a/src/bin/pg_waldump/archive_waldump.c
+++ b/src/bin/pg_waldump/archive_waldump.c
@@ -307,6 +307,7 @@ read_archive_wal_page(XLogDumpPrivate *privateInfo, XLogRecPtr targetPagePtr,
     XLByteToSeg(targetPagePtr, segno, segsize);
     XLogFileName(fname, privateInfo->timeline, segno, segsize);
     entry = get_archive_wal_entry(fname, privateInfo);
+    Assert(!entry->spilled);

     while (nbytes > 0)
     {
@@ -403,6 +404,14 @@ read_archive_wal_page(XLogDumpPrivate *privateInfo, XLogRecPtr targetPagePtr,
                          privateInfo->archive_name, fname,
                          (long long int) (count - nbytes),
                          (long long int) count);
+
+            /*
+             * Loading more data may have moved hash table entries, so we must
+             * re-look-up the one we are reading from.
+             */
+            entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname);
+            /* ... it had better still be there */
+            Assert(entry != NULL);
         }
     }

@@ -429,6 +438,7 @@ void
 free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
 {
     ArchivedWALFile *entry;
+    const char *oldfname;

     entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname);

@@ -454,7 +464,23 @@ free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
     if (privateInfo->cur_file == entry)
         privateInfo->cur_file = NULL;

+    /*
+     * ArchivedWAL_delete_item may cause other hash table entries to move.
+     * Therefore, if cur_file isn't NULL now, we have to be prepared to look
+     * that entry up again after the deletion.  Fortunately, the entry's fname
+     * string won't move.
+     */
+    oldfname = privateInfo->cur_file ? privateInfo->cur_file->fname : NULL;
+
     ArchivedWAL_delete_item(privateInfo->archive_wal_htab, entry);
+
+    if (oldfname)
+    {
+        privateInfo->cur_file = ArchivedWAL_lookup(privateInfo->archive_wal_htab,
+                                                   oldfname);
+        /* ... it had better still be there */
+        Assert(privateInfo->cur_file != NULL);
+    }
 }

 /*
@@ -700,6 +726,9 @@ astreamer_waldump_content(astreamer *streamer, astreamer_member *member,
                 ArchivedWALFile *entry;
                 bool        found;

+                /* Shouldn't see MEMBER_HEADER in the middle of a file */
+                Assert(privateInfo->cur_file == NULL);
+
                 pg_log_debug("reading \"%s\"", member->pathname);

                 if (!member_is_wal_file(mystreamer, member, &fname))
@@ -728,6 +757,13 @@ astreamer_waldump_content(astreamer *streamer, astreamer_member *member,
                     }
                 }

+                /*
+                 * Note: ArchivedWAL_insert may cause existing hash table
+                 * entries to move.  While cur_file is known to be NULL right
+                 * now, read_archive_wal_page may have a live hash entry
+                 * pointer, which it needs to take care to update after
+                 * read_archive_file completes.
+                 */
                 entry = ArchivedWAL_insert(privateInfo->archive_wal_htab,
                                            fname, &found);

--
2.43.7


В списке pgsql-hackers по дате отправления: