Re: Warn when parallel restoring a custom dump without data offsets

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Warn when parallel restoring a custom dump without data offsets
Дата
Msg-id 2571310.1594583878@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Warn when parallel restoring a custom dump without data offsets  (David Gilman <dgilman@gilslotd.com>)
Ответы Re: Warn when parallel restoring a custom dump without data offsets  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
David Gilman <dgilman@gilslotd.com> writes:
> On Thu, Jul 02, 2020 at 05:25:21PM -0400, Tom Lane wrote:
>> I guess I'm completely confused about the purpose of these patches.
>> Far from coping with the situation of an unseekable file, they appear
>> to change pg_restore so that it fails altogether if it can't seek
>> its input file.  Why would we want to do this?

> I'm not sure where the "fails altogether if it can't seek" is.

I misread the patch, is where :-(

As penance, I spent some time studying this patchset, and have a few
comments:

1. The proposed doc change in 0001 seems out-of-date; isn't it adding a
warning about exactly the deficiency that the rest of the patch is
eliminating?  Note that the preceding para already says that the input
has to be seekable, so that's covered.  Maybe there is reason for
documenting that parallel restore will be slower if the archive was
written in a non-seekable way ... but that's not what this says.

2. It struck me that the patch is still pretty inefficient, in that
anytime it has to back up in an offset-less archive, it blindly rewinds
to dataStart and rescans everything.  In the worst case that'd still be
O(N^2) work, and it's really not necessary, because once we've seen a
given data block we know where it is.  We just have to remember that,
which seems easy enough.  (Well, on Windows it's a bit trickier because
the state in question is shared across threads; but that's good, it might
save some work.)

3. Extending on #2, we actually don't need the rewind and retry logic
at all.  If we are looking for a block we haven't already seen, and we
get to the end of the archive, it ain't there.  (This is a bit less
obvious in the Windows case than otherwise, but I think it's still true,
given that the start state is either "all offsets known" or "no offsets
known".  A particular thread might skip over some blocks on the strength
of an offset established by another thread, but the blocks ahead of that
spot must now all have known offsets.)

4. Patch 0002 seems mighty expensive for the amount of code coverage
it's adding.  On my machine it seems to raise the overall runtime of
pg_dump's "make installcheck" by about 10%, and the only new coverage
is of the few lines added here.  I wonder if we couldn't cover that
more cheaply by testing what happens when we use a "-L" option with
an intentionally mis-sorted restore list.

5. I'm inclined to reject 0003.  It's not saving anything very meaningful,
and we'd just have to put the code back whenever somebody gets around
to making pg_backup_tar.c capable of out-of-order restores like
pg_backup_custom.c is now able to do.

The attached 0001 rewrites your 0001 as per the above ideas (dropping
the proposed doc change for now), and includes your 0004 for simplicity.
I'm including your 0002 verbatim just so the cfbot will be able to do a
meaningful test on 0001; but as stated, I don't really want to commit it.

            regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 6ab122242c..2659676fd9 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -421,13 +421,47 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
     {
         /*
          * We cannot seek directly to the desired block.  Instead, skip over
-         * block headers until we find the one we want.  This could fail if we
-         * are asked to restore items out-of-order.
+         * block headers until we find the one we want.  Remember the
+         * positions of skipped-over blocks, so that if we later decide we
+         * need to read one, we'll be able to seek to it.
          */
-        _readBlockHeader(AH, &blkType, &id);
-
-        while (blkType != EOF && id != te->dumpId)
+        for (;;)
         {
+            pgoff_t        thisBlkPos = _getFilePos(AH, ctx);
+            TocEntry   *otherte;
+
+            _readBlockHeader(AH, &blkType, &id);
+
+            if (blkType == EOF || id == te->dumpId)
+                break;
+
+            otherte = getTocEntryByDumpId(AH, id);
+            if (otherte && otherte->formatData)
+            {
+                lclTocEntry *othertctx = (lclTocEntry *) otherte->formatData;
+
+                /*
+                 * Note: on Windows, multiple threads might access/update the
+                 * same lclTocEntry concurrently, but that should be safe as
+                 * long as we update dataPos before dataState.  Ideally, we'd
+                 * use pg_write_barrier() to enforce that, but the needed
+                 * infrastructure doesn't exist in frontend code.  But Windows
+                 * only runs on machines with strong store ordering, so it
+                 * should be okay for now.
+                 */
+                if (othertctx->dataState == K_OFFSET_POS_NOT_SET)
+                {
+                    othertctx->dataPos = thisBlkPos;
+                    othertctx->dataState = K_OFFSET_POS_SET;
+                }
+                else if (othertctx->dataPos != thisBlkPos ||
+                         othertctx->dataState != K_OFFSET_POS_SET)
+                {
+                    /* sanity check */
+                    pg_log_warning("data block %d has wrong seek position", id);
+                }
+            }
+
             switch (blkType)
             {
                 case BLK_DATA:
@@ -443,7 +477,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
                           blkType);
                     break;
             }
-            _readBlockHeader(AH, &blkType, &id);
         }
     }
     else
@@ -455,20 +488,18 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
         _readBlockHeader(AH, &blkType, &id);
     }

-    /* Produce suitable failure message if we fell off end of file */
+    /*
+     * If we reached EOF without finding the block we want, then either it
+     * doesn't exist, or it does but we lack the ability to seek back to it.
+     */
     if (blkType == EOF)
     {
-        if (tctx->dataState == K_OFFSET_POS_NOT_SET)
-            fatal("could not find block ID %d in archive -- "
-                  "possibly due to out-of-order restore request, "
-                  "which cannot be handled due to lack of data offsets in archive",
-                  te->dumpId);
-        else if (!ctx->hasSeek)
+        if (!ctx->hasSeek)
             fatal("could not find block ID %d in archive -- "
                   "possibly due to out-of-order restore request, "
                   "which cannot be handled due to non-seekable input file",
                   te->dumpId);
-        else                    /* huh, the dataPos led us to EOF? */
+        else
             fatal("could not find block ID %d in archive -- "
                   "possibly corrupt archive",
                   te->dumpId);
@@ -560,19 +591,27 @@ _skipData(ArchiveHandle *AH)
     blkLen = ReadInt(AH);
     while (blkLen != 0)
     {
-        if (blkLen > buflen)
+        if (ctx->hasSeek)
         {
-            if (buf)
-                free(buf);
-            buf = (char *) pg_malloc(blkLen);
-            buflen = blkLen;
+            if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
+                fatal("error during file seek: %m");
         }
-        if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+        else
         {
-            if (feof(AH->FH))
-                fatal("could not read from input file: end of file");
-            else
-                fatal("could not read from input file: %m");
+            if (blkLen > buflen)
+            {
+                if (buf)
+                    free(buf);
+                buf = (char *) pg_malloc(blkLen);
+                buflen = blkLen;
+            }
+            if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+            {
+                if (feof(AH->FH))
+                    fatal("could not read from input file: end of file");
+                else
+                    fatal("could not read from input file: %m");
+            }
         }

         ctx->filePos += blkLen;
@@ -830,10 +869,13 @@ _Clone(ArchiveHandle *AH)
         fatal("compressor active");

     /*
+     * We intentionally do not clone TOC-entry-local state: it's useful to
+     * share knowledge about where the data blocks are across threads.
+     * _PrintTocData has to be careful about the order of operations on that
+     * state, though.
+     *
      * Note: we do not make a local lo_buf because we expect at most one BLOBS
-     * entry per archive, so no parallelism is possible.  Likewise,
-     * TOC-entry-local state isn't an issue because any one TOC entry is
-     * touched by just one worker child.
+     * entry per archive, so no parallelism is possible.
      */
 }

>From 6e8294021482fb76cb17d6c150ec6acb6d0cac37 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 2/5] Add integration test for out-of-order TOC requests in
 pg_restore

---
 src/bin/pg_dump/t/002_pg_dump.pl | 107 ++++++++++++++++++++++++++++++-
 src/test/perl/TestLib.pm         |  12 +++-
 2 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..d99e7a5d22 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+#     pg_dump (dump_cmd) ->
+#     pg_restore (restore_cmd, real database) ->
+#    pg_dump (secondary_dump_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+#     pg_dump (dump_cmd) ->
+#     pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dump_cmd must read
+# and write from it.

 my %pgdump_runs = (
     binary_upgrade => {
@@ -139,6 +150,25 @@ my %pgdump_runs = (
             "$tempdir/defaults_custom_format.dump",
         ],
     },
+    defaults_custom_format_no_seek_parallel_restore => {
+        test_key => 'defaults',
+        dump_cmd => [
+            ['pg_dump', '-Fc', '-Z6', '--no-sync', 'postgres'],
+            '|',
+            ['perl', '-pe', ''],  # make pg_dump's stdout unseekable
+            ">$tempdir/defaults_custom_format_no_seek_parallel_restore"
+        ],
+        restore_cmd => [
+            'pg_restore', '-Fc', '--jobs=2',
+            '--dbname=secondary_dump',
+            "$tempdir/defaults_custom_format_no_seek_parallel_restore",
+        ],
+        secondary_dump_cmd => [
+            'pg_dump', '-Fp', '--no-sync',
+            "--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+            'secondary_dump',
+        ],
+    },

     # Do not use --no-sync to give test coverage for data sync.
     defaults_dir_format => {
@@ -153,6 +183,24 @@ my %pgdump_runs = (
             "$tempdir/defaults_dir_format",
         ],
     },
+    defaults_dir_format_parallel => {
+        test_key => 'defaults',
+        dump_cmd => [
+            'pg_dump',                                      '-Fd',
+            '--jobs=2',                                     '--no-sync',
+            "--file=$tempdir/defaults_dir_format_parallel", 'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore', '-Fd',
+            '--jobs=2', '--dbname=secondary_dump',
+            "$tempdir/defaults_dir_format_parallel",
+        ],
+        secondary_dump_cmd => [
+            'pg_dump', '-Fp', '--no-sync',
+            "--file=$tempdir/defaults_dir_format_parallel.sql",
+            'secondary_dump',
+        ],
+    },

     # Do not use --no-sync to give test coverage for data sync.
     defaults_parallel => {
@@ -3298,6 +3346,35 @@ my %tests = (
             %full_runs, %dump_test_schema_runs, section_pre_data => 1,
         },
         unlike => { exclude_dump_test_schema => 1 },
+    },
+
+    # This reliably produces the "possibly due to out-of-order restore request"
+    # when restoring with -j 2 when this was written but future changes to how
+    # pg_dump works could accidentally correctly order things as we're not really
+    # telling pg_dump how to do its on-disk layout.
+    'custom dump out-of-order' => {
+        create_sql   => '
+            CREATE TABLE dump_test.ooo_parent0 (id int primary key);
+            CREATE TABLE dump_test.ooo_child0 (
+                parent0 int references dump_test.ooo_parent0 (id)
+            );
+            CREATE TABLE dump_test.ooo_parent1 (id int primary key);
+            CREATE TABLE dump_test.ooo_child1 (
+                parent0 int references dump_test.ooo_parent1 (id)
+            );',
+        regexp => qr/^
+             \QCREATE TABLE dump_test.ooo_child0 (\E\n
+             \s+\Qparent0 integer\E\n
+             \Q);\E\n/xm,
+        like => {
+            %full_runs, %dump_test_schema_runs,
+            section_pre_data => 1,
+            defaults_custom_format_no_seek => 1,
+            defaults_custom_format_no_seek_parallel_restore => 1,
+        },
+        unlike => {
+            exclude_dump_test_schema => 1,
+        },
     });

 #########################################
@@ -3350,6 +3427,11 @@ foreach my $run (sort keys %pgdump_runs)
         $num_tests++;
     }

+    if ($pgdump_runs{$run}->{secondary_dump_cmd})
+    {
+        $num_tests++;
+    }
+
     if ($pgdump_runs{$run}->{test_key})
     {
         $test_key = $pgdump_runs{$run}->{test_key};
@@ -3499,12 +3581,23 @@ foreach my $run (sort keys %pgdump_runs)
     $node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
         "$run: pg_dump runs");

+    if ($pgdump_runs{$run}->{secondary_dump_cmd})
+    {
+        $node->safe_psql('postgres', 'create database secondary_dump;');
+    }
+
     if ($pgdump_runs{$run}->{restore_cmd})
     {
         $node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
             "$run: pg_restore runs");
     }

+    if ($pgdump_runs{$run}->{secondary_dump_cmd})
+    {
+        $node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+            "$run: secondary pg_dump runs");
+    }
+
     if ($pgdump_runs{$run}->{test_key})
     {
         $test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3654,14 @@ foreach my $run (sort keys %pgdump_runs)
             }
         }
     }
+
+    if ($pgdump_runs{$run}->{secondary_dump_cmd})
+    {
+        $node->safe_psql('secondary_dump',
+            'alter subscription sub1 set (slot_name = NONE);');
+        $node->safe_psql('secondary_dump', 'drop subscription sub1;');
+        $node->safe_psql('postgres',       'drop database secondary_dump;');
+    }
 }

 #########################################
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index d579d5c177..fb90d1abcc 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -338,8 +338,16 @@ The return value from the command is passed through.

 sub run_log
 {
-    print("# Running: " . join(" ", @{ $_[0] }) . "\n");
-    return IPC::Run::run(@_);
+    my $flatten = sub {
+        return map { ref eq 'ARRAY' ? @$_ : $_ } @_;
+    };
+    my @x = @{$_[0]};
+    print("# Running: " . join(" ", $flatten->(@x)) . "\n");
+    if (ref($x[0]) ne '') {
+        return IPC::Run::run(@x);
+    } else {
+        return IPC::Run::run(@_);
+    }
 }

 =pod
--
2.27.0


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: explain HashAggregate to report bucket and memory stats
Следующее
От: Daniel Gustafsson
Дата:
Сообщение: Re: should INSERT SELECT use a BulkInsertState?