[BUG] Archive recovery failure on 9.3+.

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема [BUG] Archive recovery failure on 9.3+.
Дата
Msg-id 20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответы Re: [BUG] Archive recovery failure on 9.3+.
Re: [BUG] Archive recovery failure on 9.3+.
Re: [BUG] Archive recovery failure on 9.3+.
Список pgsql-hackers
Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

> if (checkPoint.redo < RecPtr)
> {
>     /* back up to find the record */
>     record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,
- recvtest.sh: Replay script. Step 1 and 2 makes the condition  and step 3 causes the issue.
- archrecvfix_20131212.patch: The patch fixes the issue. Archive  recovery reads pg_xlog before trying older TLI in
archive similarly to 9.1- by this patch.
 

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#/bin/bash

ROOT=`pwd`
PGDATA=$ROOT/test/data
ARCHDIR=$ROOT/test/arc

if [ ! -d $ARCHDIR -o ! -d $PGDATA ]; then   echo "$PGDATA and/or $ARCHDIR not found"   exit
fi

echo "### EMPTY ARCHIVE DIRECTORY ###"
if [ -d $ARCHDIR ]; then rm -f $ARCHDIR/*; fi

echo "### EMPTY PGDATA DIRECTORY ###"
if [ -d $PGDATA ]; then rm -r $PGDATA/*; fi

echo "### DO INITDB ###"
initdb -D $PGDATA > /dev/null

echo "### set up postgresql.conf ###"
cat >> $PGDATA/postgresql.conf <<EOF
wal_level = archive
archive_mode = on
archive_command = '/bin/cp %p $ARCHDIR/%f'
# log_min_messages = debug5
EOF

echo "### STAGE 1/3 -- PUT XLOG ..001...001 AND ..002.HISTORY INTO ARCHIVE ###"
echo "### STAGE 1/3: 1/2 START SERVER ###"
pg_ctl start -D $PGDATA -w

echo "### STAGE 1/3: 2/2 STOP SERVER ###"
pg_ctl stop -D $PGDATA

echo "### STAGE 2/3 -- PUT XLOG ..002...001 INTO ONLY pg_xlog ###"
echo "### STAGE 2/3: 1/3 PREPARE recovery.conf ###"
cat > $PGDATA/recovery.conf <<EOF
restore_command = '/bin/cp $ARCHDIR/%f %p'
EOF

echo "### STAGE 2/3: 2/3 START SERVER IN ARCHIVE RECOVERY MODE ###"
pg_ctl start -D $PGDATA -w

echo "### STAGE 2/3: 3/3 STOP SERVER IMMEDIATELY ###"
pg_ctl stop -m i -D $PGDATA

echo "### ls $ARCHDIR"
ls $ARCHDIR

echo "### ls $PGDATA/pg_xlog"
ls $PGDATA/pg_xlog


echo "### STAGE 3/3 - START SERVER IN ARCHIVE RECOVERY MODE AGAIN ###"
echo "### STAGE 3/3: 1/2 RESTORE recovery.conf ###"
mv $PGDATA/recovery.done $PGDATA/recovery.conf

echo "### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###"
pg_ctl start -D $PGDATA -w -t 2
if [ $? -ne 0 ]; then echo  "### SERVER CRASHED ###" exit
fi
echo "### SERVER SEEMS SUCCESSFULLY UP. STOP IT. ###"
pg_ctl stop -D $PGDATA -w
echo  "### SERVER DID NOT CRASH ###"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6fa5479..75be478 100755
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10935,10 +10935,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,                    curFileTLI
=0;                /*
 
-                 * Try to restore the file from archive, or read an existing
-                 * file from pg_xlog.
+                 * When XLOG_FROM_ARCHIVE, read xlog file with largest TLI
+                 * preferring archive to pg_xlog. Or when XLOG_FROM_PG_XLOG,
+                 * search only pg_xlog.                 */
-                readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, currentSource);
+                readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
+                                    currentSource == XLOG_FROM_ARCHIVE ?
+                                    XLOG_FROM_ANY : currentSource);                if (readFile >= 0)
 return true;    /* success! */ 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: pgsql: Fix a couple of bugs in MultiXactId freezing
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: SSL: better default ciphersuite