Hello, we happened to see server crash on archive recovery under some condition.
After TLI was incremented, there should be the case that the WAL file for older timeline is archived but not for that of the same segment id but for newer timeline. Archive recovery should fail for the case with PANIC error like follows,
| PANIC: record with zero length at 0/1820D40
Replay script is attached. This issue occured for 9.4dev, 9.3.2, and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the TLI before trying archive for older TLIs.
This occurrs during fetching checkpoint redo record in archive recovery.
> if (checkPoint.redo < RecPtr) > { > /* back up to find the record */ > record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
And this is caused by that the segment file for older timeline in archive directory is preferred to that for newer timeline in pg_xlog.
Looking into pg_xlog before trying the older TLIs in archive like 9.2- fixes this issue. The attached patch is one possible solution for 9.4dev.
Attached files are,
- recvtest.sh: Replay script. Step 1 and 2 makes the condition and step 3 causes the issue.
- archrecvfix_20131212.patch: The patch fixes the issue. Archive recovery reads pg_xlog before trying older TLI in archive similarly to 9.1- by this patch.
regards, -- Kyotaro Horiguchi NTT Open Source Software Center