Recovery vs. RelationTruncate(); skipFsync vs. unlogged rels
От | Noah Misch |
---|---|
Тема | Recovery vs. RelationTruncate(); skipFsync vs. unlogged rels |
Дата | |
Msg-id | 20191130060514.GA108490@gust.leadboat.com обсуждение исходный текст |
Список | pgsql-hackers |
While reviewing the patches in http://postgr.es/m/20191126.213752.2132434859202124793.horikyota.ntt@gmail.com I noticed three related problems. The defects motivating that other thread are specific to wal_level=minimal. The related problems are not specific to any wal_level, hence my starting a separate thread. 1. We don't sync unlogged tables after skipFsync=true operations. smgrwrite() explains the contract for skipFsync=true. Concretely, it causes md.c to skip enqueueing the relation to be synced during the next checkpoint. Three places in the tree call smgrextend(..., skipFsync=true) and/or smgrwrite(..., skipFsync=true): rewriteheap.c (CLUSTER, VACUUM FULL) _bt_blwritepage() (CREATE INDEX) RelationCopyStorage() (ALTER TABLE SET TABLESPACE) They use logic like this to decide when to sync: /* * If the rel is WAL-logged, must fsync before commit. We use heap_sync * to ensure that the toast table gets fsync'd too. (For a temp or * unlogged rel we don't care since the data will be gone after a crash * anyway.) * * It's obvious that we must do this when not WAL-logging the copy. It's * less obvious that we have to do it even if we did WAL-log the copied * pages. The reason is that since we're copying outside shared buffers, a * CHECKPOINT occurring during the copy has no way to flush the previously * written data to disk (indeed it won't know the new rel even exists). A * crash later on would replay WAL from the checkpoint, therefore it * wouldn't replay our earlier WAL entries. If we do not fsync those pages * here, they might still not be on disk when the crash occurs. */ if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork) smgrimmedsync(dst, forkNum); However, the reasoning about unlogged rels is incomplete. Normally, we sync unlogged rels during each checkpoint after we wrote out a dirty buffer, just as we do for permanent rels. (It would be enough to sync them during the shutdown checkpoint. That would require a data structure to track unlogged relation files dirtied since StartupXLOG().) However, due to the code above, we miss syncing them if one of these DDL operations is the last operation changing the unlogged rel before a shutdown checkpoint. I've attached a test-only patch demonstrating the problem. This can cause unlogged tables to have invalid contents if an OS crash happens with PostgreSQL stopped. I think the fix is simple: test "relpersistence != RELPERSISTENCE_TEMP" instead. 2. RelationTruncate() needs defense against concurrent checkpoints. In RelationTruncate(), nothing stops a checkpoint from starting and completing after XLogInsert(... XLOG_SMGR_TRUNCATE) and before smgrtruncate(). That checkpoint could move the redo pointer past the XLOG_SMGR_TRUNCATE record, making the outcome equivalent to having never written the WAL record. If the OS crashes before the first post-smgrtruncate() checkpoint, the filesystem may forget some or all ftruncate() calls. The symptom would be unexpected tuples in the relation. Example test procedure: BEGIN; CREATE TABLE t (c) AS SELECT 1; -- set breakpoint at XLogFlush() TRUNCATE t; -- hit breakpoint -- while stopped at breakpoint, issue CHECKPOINT, then release debugger COMMIT; -- hard crash forgets some ftruncate() pg_ctl -w start # REDO does not reissue ftruncate() The fix is to set delayChkpt before writing WAL and clear it after smgrtruncate(), like we do in EndPrepare(). 3. smgrtruncate()/mdtruncate() is not qualified to be called in recovery. smgr_redo() calls smgrtruncate() to replay XLOG_SMGR_TRUNCATE records. However, mdtruncate() relies on the following md.c invariant: * On disk, a relation must consist of consecutively numbered segment * files in the pattern * -- Zero or more full segments of exactly RELSEG_SIZE blocks each * -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks * -- Optionally, any number of inactive segments of size 0 blocks. That invariant does not hold before reachedConsistency. Suppose the unclean shutdown happened after the original (non-recovery) mdtruncate() and before anything synced the truncated segments. The OS might remember the ftruncate() of segment zero and lose the ftruncate() of segment one, upsetting the invariant. The symptom would again be unexpected tuples in the relation. My preferred fix is to make mdtruncate() truncate each inactive segment, stopping when opening a segment fails with ENOENT, like mdunlinkfork() does. An alternative would be to add a last_segment_truncated field to xl_smgr_truncate, then truncate up to that segment. Does anyone see a way to improve on those proposed fixes? Thanks, nm
Вложения
В списке pgsql-hackers по дате отправления: