Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Поиск
Список
Период
Сортировка
От Paul Guo
Тема Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Дата
Msg-id CAEET0ZGhmDKrq7JJu2rLLqcJBR8pA4OYrKsirZ5Ft8-deG1e8A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Paul Guo <pguo@pivotal.io>)
Ответы Re: standby recovery fails (tablespace related) (tentative patchand discussion)  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in dbase_redo(). Both missing cases seem to be possible. For the src path missing case, mkdir_p() is meaningless. It seems that moving the directory existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn: 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to pg_tblspc/16384/PG_12_201904281/16386

rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn: 0/01638EB8, prev 0/01638E40, desc: DROP dir pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot about how to test such case from Kyotaro's patch series.👍

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
>
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
>
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
>
> XLogReadBufferExtended@xlogutils.c
> | * Create the target file if it doesn't already exist.  This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted.  (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash.  Better to write the data
> | * until we are actually told to delete the file.)
>
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
>
> me> but I haven't checked this covers all places where need the same
> me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
  - are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
  - Resources works on buffer is not affected.

SMGR:
  - Both CREATE and TRUNCATE seems fine.

TBLSPC:
  - We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the code logic in GetDatabasePath().

2) It seems that src directory could be missing then dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1  
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1'; 
create tablespace tbs2 location '/tmp/tbspace2'; 
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change could fix that.

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
         */
        FlushDatabaseBuffers(xlrec->src_db_id);

+       /*
+        * It is possible that the source directory is missing if
+        * we are re-replaying the xlog while subsequent xlogs
+        * drop the tablespace in previous replaying. For this
+        * we just skip.
+        */
+       if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+           return;
+
        /*
         * Copy this subdirectory to the new location
         *

If we want to fix the issue by ignoring the dst path create failure, I do not think we should do
that in copydir() since copydir() seems to be a common function. I'm not sure whether it is
used by some extensions or not. If no maybe we should move the dst patch create logic
out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more graceful and simpler.
 
Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to be error-prone
along with postgre evolving since they are hard to test and also we are not easy to think out
various potential bad cases. Is it possible that we should do real checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow down standby
but master also does this and also these operations are not usual, espeically it seems that it
does not slow down wal receiving usually?



Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: John Naylor
Дата:
Сообщение: Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Следующее
От: Andres Freund
Дата:
Сообщение: Re: REINDEX INDEX results in a crash for an index of pg_class since9.6